date:20210317

Code comment fix

2021-03-17 Thread Vik Fearing

When table oids were removed by commit 578b229718e, the function
CatalogTupleInsert() was modified to return void but its comment was
left intact.  Here is a trivial patch to fix that.
-- 
Vik Fearing
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 284ceaa6b9..4d1440cd3a 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -210,7 +210,6 @@ CatalogTupleCheckConstraints(Relation heapRel, HeapTuple tup)
  * CatalogTupleInsert - do heap and indexing work for a new catalog tuple
  *
  * Insert the tuple data in "tup" into the specified catalog relation.
- * The Oid of the inserted tuple is returned.
  *
  * This is a convenience routine for the common case of inserting a single
  * tuple in a system catalog; it inserts a new heap tuple, keeping indexes

RE: [PATCH] pgbench: improve \sleep meta command

2021-03-17 Thread kuroda.hay...@fujitsu.com

Dear Fujii-san,

Thank you for updating the patch.
I understand that you don't want to change the current specification.

```diff
+   if (usec == 0)
+   {
+   char   *c = var;
+
+   /* Skip sign */
+   if (*c == '+' || *c == '-')
+   c++;
```

In my understanding the skip is not necessary, because
plus sign is already removed in the executeMetaCommand() and minus value can be 
returned by atoi().


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-17 Thread Michael Paquier

On Mon, Mar 15, 2021 at 04:38:08PM +0900, Michael Paquier wrote:
> On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote:
>> Logical decoding stuff is (I think) designed to turn any backend into
>> a walsender, which may need to maintain ThisTimeLineID.  It seems to
>> me that logical decoding stuff indents to maintain ThisTimeLineID of
>> such backends at reading a WAL record. logical_read_xlog_page also
>> updates ThisTimeLineID and pg_logical_slot_get_changes_guts(),
>> pg_replication_slot_advance() (and maybe other functions) updates
>> ThisTimeLineID. So it is natural that local_read_xlog_page() updates
>> it since it is intended to be used used in logical decoding plugins.
>
> Logical decoding contexts cannot be created while in recovery as per
> CheckLogicalDecodingRequirements(), and as mentioned not everything
> is
> in place to allow that.  FWIW, I think that it is just confusing for
> pg_replication_slot_advance() and pg_logical_slot_get_changes_guts()
> to update it, and we just look for the latest value each time it is
> necessary when reading a new WAL page.

Studying some history today, having read_local_xlog_page() directly
update ThisTimeLineID has been extensively discussed here back in 2017
to attempt to introduce logical decoding on standbys (1148e22a):
https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com

Currently with HEAD and back branches, nothing would be broken as
logical contexts cannot exist in recovery.  Still it would be easy
to miss the new behavior for anybody attempting to work more on this
feature in the future if we change read_local_xlog_page to not update
ThisTimeLineID anymore.  Resetting the value of ThisTimeLineID in
read_local_xlog_page() does not seem completely right either with this
argument, as they could be some custom code relying on the existing
behavior of read_local_xlog_page() to maintain ThisTimeLineID.

Hmmm.  I am wondering whether the best answer for the moment would not
be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
that's the local change that uses read_local_xlog_page().

The state of the code is really confusing on HEAD, and I'd like to
think that the best thing we could do in the long-term is to make the
logical decoding path not rely on ThisTimeLineID at all and decouple
all that, putting the code in a state sane enough so as we don't
finish with similar errors as what is discussed on this thread.  That
would be a work for a different patch though, not for stable
branches.  And seeing some slot and advancing functions update
directly ThisTimeLineID is no good either..

Any thoughts?
--
Michael

signature.asc
Description: PGP signature

Re: Code comment fix

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 08:31:16AM +0100, Vik Fearing wrote:
> When table oids were removed by commit 578b229718e, the function
> CatalogTupleInsert() was modified to return void but its comment was
> left intact.  Here is a trivial patch to fix that.

Thanks, Vik.  Good catch.  I'll fix that in a bit.
--
Michael


signature.asc
Description: PGP signature

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
> I also think that there should be a single usable top label, otherwise it will
> lead to confusing code and it can be a source of bug.

Okay, that's fine by me.  Could it be possible to come up with an
approach that does not hijack the namespace list contruction and the
lookup logic as much as it does now?  I get it that the patch is done
this way because of the ordering of operations done for the initial ns
list creation and the grammar parsing that adds the routine label on
top of the root one, but I'd like to believe that there are more solid
ways to achieve that, like for example something that decouples the
root label and its alias, or something that associates an alias
directly with its PLpgSQL_nsitem entry?
--
Michael


signature.asc
Description: PGP signature

Re: Transactions involving multiple postgres foreign servers, take 2

2021-03-17 Thread Zhihong Yu

Hi,
For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :

With this commit, the foreign server modified within the transaction marked
as 'modified'.

transaction marked -> transaction is marked

+#define IsForeignTwophaseCommitRequested() \
+(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)

Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro
should be named: IsForeignTwophaseCommitRequired.

+static bool
+checkForeignTwophaseCommitRequired(bool local_modified)

+   if (!ServerSupportTwophaseCommit(fdw_part))
+   have_no_twophase = true;
...
+   if (have_no_twophase)
+   ereport(ERROR,

It seems the error case should be reported within the loop. This way, we
don't need to iterate the other participant(s).
Accordingly, nserverswritten should be incremented for local server prior
to the loop. The condition in the loop would become if
(!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
have_no_twophase is no longer needed.

Cheers

On Tue, Mar 16, 2021 at 8:04 PM Masahiko Sawada 
wrote:

> On Mon, Mar 15, 2021 at 3:55 AM Ibrar Ahmed  wrote:
> >
> >
> >
> > On Thu, Feb 11, 2021 at 6:25 PM Masahiko Sawada 
> wrote:
> >>
> >> On Fri, Feb 5, 2021 at 2:45 PM Masahiko Sawada 
> wrote:
> >> >
> >> > On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao <
> masao.fu...@oss.nttdata.com> wrote:
> >> > >
> >> > >
> >> > >
> >> > > On 2021/01/27 14:08, Masahiko Sawada wrote:
> >> > > > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> >> > > >  wrote:
> >> > > >>
> >> > > >>
> >> > > >> You fixed some issues. But maybe you forgot to attach the latest
> patches?
> >> > > >
> >> > > > Yes, I've attached the updated patches.
> >> > >
> >> > > Thanks for updating the patch! I tried to review 0001 and 0002 as
> the self-contained change.
> >> > >
> >> > > + * An FDW that implements both commit and rollback APIs can
> request to register
> >> > > + * the foreign transaction by FdwXactRegisterXact() to participate
> it to a
> >> > > + * group of distributed tranasction.  The registered foreign
> transactions are
> >> > > + * identified by OIDs of server and user.
> >> > >
> >> > > I'm afraid that the combination of OIDs of server and user is not
> unique. IOW, more than one foreign transactions can have the same
> combination of OIDs of server and user. For example, the following two
> SELECT queries start the different foreign transactions but their user OID
> is the same. OID of user mapping should be used instead of OID of user?
> >> > >
> >> > >  CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
> >> > >  CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user
> 'postgres');
> >> > >  CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user
> 'postgres');
> >> > >  CREATE TABLE t(i int);
> >> > >  CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS
> (table_name 't');
> >> > >  BEGIN;
> >> > >  SELECT * FROM ft;
> >> > >  DROP USER MAPPING FOR postgres SERVER loopback ;
> >> > >  SELECT * FROM ft;
> >> > >  COMMIT;
> >> >
> >> > Good catch. I've considered using user mapping OID or a pair of user
> >> > mapping OID and server OID as a key of foreign transactions but I
> >> > think it also has a problem if an FDW caches the connection by pair of
> >> > server OID and user OID whereas the core identifies them by user
> >> > mapping OID. For instance, mysql_fdw manages connections by pair of
> >> > server OID and user OID.
> >> >
> >> > For example, let's consider the following execution:
> >> >
> >> > BEGIN;
> >> > SET ROLE user_A;
> >> > INSERT INTO ft1 VALUES (1);
> >> > SET ROLE user_B;
> >> > INSERT INTO ft1 VALUES (1);
> >> > COMMIT;
> >> >
> >> > Suppose that an FDW identifies the connections by {server OID, user
> >> > OID} and the core GTM identifies the transactions by user mapping OID,
> >> > and user_A and user_B use the public user mapping to connect server_X.
> >> > In the FDW, there are two connections identified by {user_A, sever_X}
> >> > and {user_B, server_X} respectively, and therefore opens two
> >> > transactions on each connection, while GTM has only one FdwXact entry
> >> > because the two connections refer to the same user mapping OID. As a
> >> > result, at the end of the transaction, GTM ends only one foreign
> >> > transaction, leaving another one.
> >> >
> >> > Using user mapping OID seems natural to me but I'm concerned that
> >> > changing role in the middle of transaction is likely to happen than
> >> > dropping the public user mapping but not sure. We would need to find
> >> > more better way.
> >>
> >> After more thought, I'm inclined to think it's better to identify
> >> foreign transactions by user mapping OID. The main reason is, I think
> >> FDWs that manages connection caches by pair of user OID and server OID
> >> potentially has a problem with the scenario Fujii-san mentioned. If an
> >> FDW has to use another user mapping (i.g., connection information) due
> >

Re: A new function to wait for the backend exit after termination

2021-03-17 Thread Bharath Rupireddy

On Wed, Mar 17, 2021 at 8:28 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy 
>  wrote in
> > Attaching v10 patch for further review.
>
> The time-out mechanism doesn't count remainingtime as expected,
> concretely it does the following.
>
> do {
>   kill();
>   WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime);
>   ResetLatch(MyLatch);
>   remainingtime -= waittime;
> } while (remainingtime > 0);
>
> So, the WaitLatch doesn't consume as much time as the set waittime in
> case of latch set. remainingtime reduces faster than the real at the
> iteration.

WaitLatch can exit without waiting for the waittime duration whenever
the MyLatch is set (SetLatch). Now the question is how frequently
SetLatch can get called in a backend? For instance, if we keep calling
pg_reload_conf in any of the backends in the cluster, then the
SetLatch will be called and the timeout in pg_wait_until_termination
will be reached fastly.  I see that this problem can also exist in
case of pg_promote function. Similarly it may exist in other places
where we have WaitLatch for timeouts.

IMO, the frequency of SetLatch calls may not be that much in real time
scenarios. If at all, the latch gets set too frequently, then the
terminate and wait functions might timeout earlier. But is it a
critical problem to worry about? (IMHO, it's not that critical) If
yes, we might as well need to fix it (I don't know how?) in other
critical areas like pg_promote?

> It wouldn't happen actually but I concern about PID recycling. We can
> make sure to get rid of the fear by checking for our BEENTRY instead
> of PID.  However, it seems to me that some additional function is
> needed in pgstat.c so that we can check the realtime value of
> PgBackendStatus, which might be too much.

The aim of the wait logic is to ensure that the process is gone from
the system processes that is why using kill(), not it's entries are
gone from the shared memory.

> +   /* If asked to wait, check whether the timeout value is valid or not. 
> */
> +   if (wait && pid != MyProcPid)
> +   {
> +   timeout = PG_GETARG_INT64(2);
> +
> +   if (timeout <= 0)
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> +errmsg("\"timeout\" must not be 
> negative or zero")));
>
> This means that pg_terminate_backend accepts negative timeouts when
> terminating myself, which looks odd.

I will change this.

> Is there any reason to reject 0 as timeout?

Actually, timeout 0 should mean that "don't wait" and we can error out
on negative values. Thoughts?

> +* Wait only if requested and the termination is successful. Self
> +* termination is allowed but waiting is not.
> +*/
> +   if (wait && pid != MyProcPid && result)
> +   result = pg_wait_until_termination(pid, timeout);
>
> Why don't we wait for myself to be terminated?  There's no guarantee
> that myself will be terminated without failure.  (I agree that that is
> not so useful, but I think there's no reason not to do so.)

We could programmatically allow it to wait in case of self termination
and it doesn't make any difference to the user, they would see
"Terminating connection due to administrator command" FATAL error. I
can remove pid != MyProcPid.

> The first suggested signature for pg_terminate_backend() with timeout
> was pg_terminate_backend(pid, timeout).  The current signature (pid,
> wait?, timeout) looks redundant.  Maybe the reason for rejecting 0
> astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we
> can wait forever in that case (as other features does).  On the other
> hand pg_terminate_backend(pid, false, 100) is apparently odd but this
> patch doesn't seem to reject it.  If there's no considerable reason
> for the current signature, I would suggest that:
>
> pg_terminate_backend(pid, timeout), where it waits forever if timeout
> is zero and waits for the timeout if positive. Negative values are not
> accepted.

So, as stated above, how about a timeout 0 (which is default) telling
"don't wait", negative error out, a positive milliseconds value
indicating that we should wait after termination?

And for pg_wait_for_backend_termination timeout 0 or negative, we error out?

IMO, the above semantics are better than timeout 0 meaning "wait
forever". Thoughts?

> +   ereport(WARNING,
> +   (errmsg("could not check the 
> existence of the backend with PID %d: %m",
> +   pid)));
> +   return false;
>
> I think this is worth ERROR. We can avoid this handling if we look
> into PgBackendEntry instead.

I will change it to ERROR.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Code comment fix

2021-03-17 Thread Vik Fearing

On 3/17/21 9:11 AM, Michael Paquier wrote:
> On Wed, Mar 17, 2021 at 08:31:16AM +0100, Vik Fearing wrote:
>> When table oids were removed by commit 578b229718e, the function
>> CatalogTupleInsert() was modified to return void but its comment was
>> left intact.  Here is a trivial patch to fix that.
> 
> Thanks, Vik.  Good catch.  I'll fix that in a bit.

Cheers.
-- 
Vik Fearing

Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-03-17 Thread Bharath Rupireddy

On Wed, Mar 17, 2021 at 8:05 AM torikoshia  wrote:
> > I have not gone through that thread though. Is there any way we can
> > detect those child processes(stats collector, sys logger) that are
> > forked by the postmaster from a backend process? Thoughts?
>
> I couldn't find good ways to do that, and thus I'm now wondering
> just changing the message.

Changed the message.

> >> I'm now wondering if changing the message to something like
> >> "PID  is not a PostgreSQL backend process".
> >>
> >> "backend process' is now defined as "Process of an instance
> >> which acts on behalf of a client session and handles its
> >> requests." in Appendix.
> >
> > Yeah, that looks good to me. IIUC, we can just change the message from
> > "PID  is not a PostgreSQL server process" to "PID  is not a
> > PostgreSQL backend process" and we don't need look for AuxiliaryProcs
> > or PostmasterPid.
>
> Changing log messages can affect operations, especially when people
> monitor the log message strings, but improving "PID  is not a
> PostgreSQL server process" does not seem to cause such problems.

Now the error message clearly says that the given pid not a backend
process. Attaching v2 patch. Please have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Improve-PID--is-not-a-PostgreSQL-server-proce.patch
Description: Binary data

Re: A new function to wait for the backend exit after termination

2021-03-17 Thread Zhihong Yu

Hi,
w.r.t. WaitLatch(), if its return value is WL_TIMEOUT, we know the
specified timeout has elapsed.
It seems WaitLatch() can be enhanced to also return the actual duration of
the wait.
This way, the caller can utilize the duration directly.

As for other places where WaitLatch() is called, similar change can be
applied on a per-case basis (with separate patches, not under this topic).

Cheers

On Wed, Mar 17, 2021 at 2:04 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Mar 17, 2021 at 8:28 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote in
> > > Attaching v10 patch for further review.
> >
> > The time-out mechanism doesn't count remainingtime as expected,
> > concretely it does the following.
> >
> > do {
> >   kill();
> >   WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime);
> >   ResetLatch(MyLatch);
> >   remainingtime -= waittime;
> > } while (remainingtime > 0);
> >
> > So, the WaitLatch doesn't consume as much time as the set waittime in
> > case of latch set. remainingtime reduces faster than the real at the
> > iteration.
>
> WaitLatch can exit without waiting for the waittime duration whenever
> the MyLatch is set (SetLatch). Now the question is how frequently
> SetLatch can get called in a backend? For instance, if we keep calling
> pg_reload_conf in any of the backends in the cluster, then the
> SetLatch will be called and the timeout in pg_wait_until_termination
> will be reached fastly.  I see that this problem can also exist in
> case of pg_promote function. Similarly it may exist in other places
> where we have WaitLatch for timeouts.
>
> IMO, the frequency of SetLatch calls may not be that much in real time
> scenarios. If at all, the latch gets set too frequently, then the
> terminate and wait functions might timeout earlier. But is it a
> critical problem to worry about? (IMHO, it's not that critical) If
> yes, we might as well need to fix it (I don't know how?) in other
> critical areas like pg_promote?
>
> > It wouldn't happen actually but I concern about PID recycling. We can
> > make sure to get rid of the fear by checking for our BEENTRY instead
> > of PID.  However, it seems to me that some additional function is
> > needed in pgstat.c so that we can check the realtime value of
> > PgBackendStatus, which might be too much.
>
> The aim of the wait logic is to ensure that the process is gone from
> the system processes that is why using kill(), not it's entries are
> gone from the shared memory.
>
> > +   /* If asked to wait, check whether the timeout value is valid or
> not. */
> > +   if (wait && pid != MyProcPid)
> > +   {
> > +   timeout = PG_GETARG_INT64(2);
> > +
> > +   if (timeout <= 0)
> > +   ereport(ERROR,
> > +
>  (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> > +errmsg("\"timeout\" must not be
> negative or zero")));
> >
> > This means that pg_terminate_backend accepts negative timeouts when
> > terminating myself, which looks odd.
>
> I will change this.
>
> > Is there any reason to reject 0 as timeout?
>
> Actually, timeout 0 should mean that "don't wait" and we can error out
> on negative values. Thoughts?
>
> > +* Wait only if requested and the termination is successful. Self
> > +* termination is allowed but waiting is not.
> > +*/
> > +   if (wait && pid != MyProcPid && result)
> > +   result = pg_wait_until_termination(pid, timeout);
> >
> > Why don't we wait for myself to be terminated?  There's no guarantee
> > that myself will be terminated without failure.  (I agree that that is
> > not so useful, but I think there's no reason not to do so.)
>
> We could programmatically allow it to wait in case of self termination
> and it doesn't make any difference to the user, they would see
> "Terminating connection due to administrator command" FATAL error. I
> can remove pid != MyProcPid.
>
> > The first suggested signature for pg_terminate_backend() with timeout
> > was pg_terminate_backend(pid, timeout).  The current signature (pid,
> > wait?, timeout) looks redundant.  Maybe the reason for rejecting 0
> > astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we
> > can wait forever in that case (as other features does).  On the other
> > hand pg_terminate_backend(pid, false, 100) is apparently odd but this
> > patch doesn't seem to reject it.  If there's no considerable reason
> > for the current signature, I would suggest that:
> >
> > pg_terminate_backend(pid, timeout), where it waits forever if timeout
> > is zero and waits for the timeout if positive. Negative values are not
> > accepted.
>
> So, as stated above, how about a timeout 0 (which is default) telling
> "don't wait", negative error out, a positive milliseconds value
> indicating that we should wait after terminati

Re: pgsql: Add libpq pipeline mode support to pgbench

2021-03-17 Thread Fabien COELHO



Bonjour Daniel, Ola Alvaro,


Add libpq pipeline mode support to pgbench

New metacommands \startpipeline and \endpipeline allow the user to run
queries in libpq pipeline mode.

Author: Daniel Vérité 
Reviewed-by: Álvaro Herrera 
Discussion: 
https://postgr.es/m/b4e34135-2bd9-4b8a-94ca-27d760da2...@manitou-mail.org


I did not notice that the libpq pipeline mode thread had a pgbench patch 
attached, otherwise I would have looked at it.


Some minor post-commit comments:

For consistency with the existing \if … \endif, ISTM that it could have 
been named \batch … \endbatch or \pipeline … \endpipeline?


ISTM that the constraint checks (nesting, no \[ag]set) could be added to 
CheckConditional that could be renamed to CheckScript. That could allow to 
simplify the checks in the command execution as mere Asserts.


--
Fabien.

RE: Parallel Inserts in CREATE TABLE AS

2021-03-17 Thread houzj.f...@fujitsu.com


> > Seems like v22 patch was failing in cfbot for one of the unstable test 
> > cases.
> > Attaching v23 patch set with modification in 0003 and 0004 patches. No
> > changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23.
> >
> > Please consider v23 for further review.
> Hi,
> 
> I was looking into the latest patch, here are some comments:

Few comments.

1)
Executing the following SQL will cause assertion failure.
---sql---
create table data(a int);
insert into data select 1 from generate_series(1,100,1) t;
explain (verbose) create table tt as select a,2 from data;
--

The stack message:
---stack---
TRAP: FailedAssertion("!allow && gather_exists && tuple_cost_opts && 
!(*tuple_cost_opts & PARALLEL_INSERT_TUP_COST_IGNORED)", File: 
"execParallel.c", Line: 1872, PID: 1618247)
postgres: houzj postgres [local] EXPLAIN(ExceptionalCondition+0x8b)[0x940f0b]
postgres: houzj postgres [local] EXPLAIN[0x67ba1c]
postgres: houzj postgres [local] EXPLAIN(ExplainOnePlan+0x1c2)[0x605997]
postgres: houzj postgres [local] EXPLAIN[0x605d11]
postgres: houzj postgres [local] EXPLAIN(ExplainOneUtility+0x162)[0x605eb0]
--
In this case, The Gather node have projection in which case parallel CTAS is 
not supported,
but we still ignore the cost in planner.
If we plan to detect the projection, we may need to check tlist_same_exprs.

+if (tlist_same_exprs)
+   {
ignore_parallel_tuple_cost(root);
+}
generate_useful_gather_paths(root, rel, false);

2)
+* Parallelize inserts only when the upper Gather node has no 
projections.
 */
-   gstate->dest = dest;
+   if (!gstate->ps.ps_ProjInfo)

IMO, It's better to add some comments about why we do not support projection 
for now.
Because, not all the projection are parallel unsafe (such as the case in 1) ), 
it will be desirable to support these later.

3)

+   if 
(IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS,
+  
¶llel_ins_info))
...
/* plan the query */
plan = pg_plan_query(query, pstate->p_sourcetext,
 
CURSOR_OPT_PARALLEL_OK, params);
...
+   if 
(IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS,
+  
¶llel_ins_info))
Currently, the patch call IsParallelInsertionAllowed() before and after 
pg_plan_query(),
This might lead to a misunderstanding that parallel_ins_info will get changed 
during pg_plan_query().
Since parallel_ins_info will not get changed in pg_plan_query, is it possible 
to add a bool flag(allowed) 
in parallel_ins_info to avoid the second call of IsParallelInsertionAllowed ?

Best regards,
houzj

Re: fdatasync performance problem with large number of DB files

2021-03-17 Thread Paul Guo

On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro  wrote:
>
> On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao  
> wrote:
> > On 2021/03/16 8:15, Thomas Munro wrote:
> > > I don't want to add a hypothetical sync_after_crash=none, because it
> > > seems like generally a bad idea.  We already have a
> > > running-with-scissors mode you could use for that: fsync=off.
> >
> > I heard that some backup tools sync the database directory when restoring 
> > it.
> > I guess that those who use such tools might want the option to disable such
> > startup sync (i.e., sync_after_crash=none) because it's not necessary.
>
> Hopefully syncfs() will return quickly in that case, without doing any work?

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

I think we could consider reviewing and then pushing the syncfs patch
at this moment since
the fsync issue really affects much although there seems to be a
better plan for this in the future,
it may make the sync step in startup much faster. Today I first
encountered a real
case in a production environment. startup spends >1hour on the fsync
step: I'm pretty
sure that the pgdata is sync-ed, and per startup pstack I saw the
startup process
one by one slowly open(), fsync() (surely do nothing) and close(), and
the pre_sync_fname() is also a burden in such a scenario. So this
issue is really
annoying.

We could discuss further optimizing the special crash recovery
scenario that users
explicitly know the sync step could be skipped (this scenario is
surely not unusual),
even having the patch. The syncfs patch could be used for this
scenario also but the
filesystem might be shared by other applications (this is not unusual
and happens in my
customer's environment for example) so syncfs() is (probably much) slower than
skipping the sync step.

-- 
Paul Guo (Vmware)

Re: logical replication worker accesses catalogs in error context callback

2021-03-17 Thread Bharath Rupireddy

On Tue, Mar 16, 2021 at 2:21 AM Tom Lane  wrote:
> > Thanks for pointing to the changes in the commit message. I corrected
> > them. Attaching v4 patch set, consider it for further review.
>
> I took a quick look at this.  I'm quite worried about the potential
> performance cost of the v4-0001 patch (the one for fixing
> slot_store_error_callback).  Previously, we didn't pay any noticeable
> cost for having the callback unless there actually was an error.
> As patched, we perform several catalog lookups per column per row,
> even in the non-error code path.  That seems like it'd be a noticeable
> performance hit.  Just to add insult to injury, it leaks memory.
>
> I propose a more radical but simpler solution: let's just not bother
> with including the type names in the context message.  How much are
> they really buying?

Thanks. In that case, the message can only return the local and remote
columns names and ignore the types (something like below). And the
user will have to figure out what are the types of those columns in
local and remote separately in case of error. Then the function
logicalrep_typmap_gettypname can also be removed. I'm not sure if this
is okay. Thoughts?

TupleDesc localtupdesc = RelationGetDescr(rel->localrel);
Form_pg_attribute localatt = TupleDescAttr(localtupdesc,
errarg->local_attnum - 1);

errcontext("processing remote data for replication target relation
\"%s.%s\" column \"%s\", "
   "remote column %s, local column %s",
   rel->remoterel.nspname, rel->remoterel.relname,
   rel->remoterel.attnames[errarg->remote_attnum],
  NameStr(localatt->attname)));

> v4-0002 (for postgres_fdw's conversion_error_callback) has the same
> problems, although mitigated a bit by not needing to do any catalog
> lookups in the non-join case.  For the join case, I wonder whether
> we could push the lookups back to plan setup time, so they don't
> need to be done over again for each row.  (Not doing lookups at all
> doesn't seem attractive there; it'd render the context message nearly
> useless.)  A different idea is to try to get the column names out
> of the query's rangetable instead of going to the catalogs.

I will think more on this point.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: a verbose option for autovacuum

2021-03-17 Thread Euler Taveira

On Wed, Mar 10, 2021, at 12:46 AM, Masahiko Sawada wrote:
> Attached a patch. I've slightly modified the format for consistency
> with heap statistics.
Since commit 5f8727f5a6, this patch doesn't apply anymore. Fortunately, it is
just a small hunk. I reviewed this patch and it looks good to me. There is just
a small issue (double space after 'if') that I fixed in the attached version.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..35be34ea06 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -314,6 +314,7 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	IndexBulkDeleteResult **indstats; /* used for autovacuum logs */
 
 	/* Used for error callback */
 	char	   *indname;
@@ -531,9 +532,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/* Do the vacuuming */
 	lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive);
 
-	/* Done with indexes */
-	vac_close_indexes(nindexes, Irel, NoLock);
-
 	/*
 	 * Compute whether we actually scanned the all unfrozen pages. If we did,
 	 * we can adjust relfrozenxid and relminmxid.
@@ -614,13 +612,18 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
-	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+	if (IsAutoVacuumWorkerProcess())
 	{
-		TimestampTz endtime = GetCurrentTimestamp();
+		TimestampTz endtime = 0;
+		int i;
 
-		if (params->log_min_duration == 0 ||
-			TimestampDifferenceExceeds(starttime, endtime,
-	   params->log_min_duration))
+		if (params->log_min_duration >= 0)
+			endtime = GetCurrentTimestamp();
+
+		if (endtime > 0 &&
+			(params->log_min_duration == 0 ||
+			 TimestampDifferenceExceeds(starttime, endtime,
+		params->log_min_duration)))
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
@@ -680,6 +683,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageHit,
 			 (long long) VacuumPageMiss,
 			 (long long) VacuumPageDirty);
+			for (i = 0; i < nindexes; i++)
+			{
+IndexBulkDeleteResult *stats = vacrelstats->indstats[i];
+
+if (!stats)
+	continue;
+
+appendStringInfo(&buf,
+ _("index \"%s\": pages: %u remain, %u newly deleted, %u currently deleted, %u reusable\n"),
+ RelationGetRelationName(Irel[i]),
+ stats->num_pages,
+ stats->pages_newly_deleted,
+ stats->pages_deleted,
+ stats->pages_free);
+			}
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
 			if (track_io_timing)
@@ -704,7 +722,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	(errmsg_internal("%s", buf.data)));
 			pfree(buf.data);
 		}
+
+		/* Cleanup index statistics */
+		for (i = 0; i < nindexes; i++)
+		{
+			if (vacrelstats->indstats[i])
+pfree(vacrelstats->indstats[i]);
+		}
 	}
+
+	/* Done with indexes */
+	vac_close_indexes(nindexes, Irel, NoLock);
 }
 
 /*
@@ -1758,7 +1786,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	/* Update index statistics */
 	if (vacrelstats->useindex)
+	{
 		update_index_statistics(Irel, indstats, nindexes);
+		vacrelstats->indstats = indstats;
+	}
 
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
@@ -3243,7 +3274,13 @@ update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
 			InvalidTransactionId,
 			InvalidMultiXactId,
 			false);
-		pfree(stats[i]);
+
+		/*
+		 * Autovacuum worker keeps the index statistics until the end
+		 * of lazy vacuum for autovacuum logs.
+		 */
+		if (!IsAutoVacuumWorkerProcess())
+			pfree(stats[i]);
 	}
 }

Re: proposal: schema variables - doc

2021-03-17 Thread Erik Rijkers

> On 2021.03.13. 07:01 Pavel Stehule  wrote:
> Hi
> fresh rebase
> [schema-variables-20210313.patch.gz]

Hi Pavel,

I notice that the phrase 'schema variable' is not in the index at the end 
('bookindex.html').  Not good.

It is also not in the index at the front of the manual - also not good.

Maybe these two (front and back index) can be added?

If a user searches the pdf, the first occurrence he finds is at:

  43.13.2.4. Global variables and constants
  (in itself that occurrence/mention is all right, but is should not be the 
first find, I think)

(I think there was in earlier versions of the patch an entry in the 'contents', 
i.e., at the front of the manual).  I think it would be good to have it in the 
front-index, pointing to either LET or CREATE VARIABLE, or maybe even to a 
small introductory paragraph somewhere else (again, I seem to remember that 
there was one in an earlier patch version).

Of the new commands that this patch brings, 'LET' is the most immediately 
illuminating for a user (even when a CREATE VARIABLE has to be done first.  
There is an entry 'LET' in the index (good), but it would be better if that 
with LET-entry too the phrase 'schema variable' occurred.  (I don't know if 
that's possible)

Then, in the CREATE VARIABLE paragraphs it says
   'Changing a schema variable is non-transactional by default.'

I think that, unless there exists a mode where schema vars can be made 
transactional, 'by default' should be deleted (and there is no such 
'transactional mode' for schema variables, is there?).  The 'Description' also 
has such a 'By default' which is better removed for the same reason.

In the CREATE VARIABLE page the example is:

CREATE VARIABLE var1 AS integer;
SELECT var1;

I suggest to make that

CREATE VARIABLE var1 AS date;
LET var1 = (select current_date);
SELECT var1;

So that the example immediately shows an application of functionality.

Thanks,

Erik Rijkers

> 
> Pavel

Re: Assertion failure with barriers in parallel hash join

2021-03-17 Thread Thomas Munro

On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro  wrote:
> According to BF animal elver there is something wrong with this
> commit.  Looking into it.

Assertion failure reproduced here and understood, but unfortunately
it'll take some more time to fix this.  I've reverted the commit for
now to unbreak the ~5 machines that are currently showing red in the
build farm.

Re: Get memory contexts of an arbitrary backend process

2021-03-17 Thread torikoshia


On 2021-03-05 14:22, Fujii Masao wrote:

On 2021/03/04 18:32, torikoshia wrote:

On 2021-01-14 19:11, torikoshia wrote:
Since pg_get_target_backend_memory_contexts() waits to dump memory 
and

it could lead dead lock as below.

  - session1
  BEGIN; TRUNCATE t;

  - session2
  BEGIN; TRUNCATE t; -- wait

  - session1
  SELECT * FROM pg_get_target_backend_memory_contexts(); --wait


Thanks for notifying me, Fujii-san.


Attached v8 patch that prohibited calling the function inside 
transactions.


Regrettably, this modification could not cope with the advisory lock 
and

I haven't come up with a good way to deal with it.

It seems to me that the architecture of the requestor waiting for the
dumper leads to this problem and complicates things.


Considering the discussion printing backtrace discussion[1], it seems
reasonable that the requestor just sends a signal and dumper dumps to
the log file.


+1


Thanks!

I remade the patch and introduced a function
pg_print_backend_memory_contexts(PID) which prints the memory contexts 
of

the specified PID to elog.

  =# SELECT pg_print_backend_memory_contexts(450855);

  ** log output **
  2021-03-17 15:21:01.942 JST [450855] LOG:  Printing memory contexts of 
PID 450855
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 0 TopMemoryContext: 
68720 total in 5 blocks; 16312 free (15 chunks); 52408 used
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 Prepared Queries: 
65536 total in 4 blocks; 35088 free (14 chunks); 30448 used
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 pgstat 
TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 
chunks); 6784 used

  ..(snip)..
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 2 CachedPlanSource: 
4096 total in 3 blocks; 680 free (0 chunks); 3416 used: PREPARE hoge_200 
AS SELECT * FROM pgbench_accounts WHERE aid = 
1...
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 3 CachedPlanQuery: 
4096 total in 3 blocks; 464 free (0 chunks); 3632 used

  ..(snip)..
  2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 Timezones: 104128 
total in 2 blocks; 2584 free (0 chunks); 101544 used
  2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 ErrorContext: 8192 
total in 1 blocks; 7928 free (5 chunks); 264 used
  2021-03-17 15:21:01.945 JST [450855] LOG:  Grand total: 2802080 bytes 
in 1399 blocks; 480568 free (178 chunks); 2321512 used



As above, the output is almost the same as MemoryContextStatsPrint()
except for the way of expression of the level.
MemoryContextStatsPrint() uses indents, but
pg_print_backend_memory_contexts() writes it as "level: %d".

Since there was discussion about enlarging StringInfo may cause
errors on OOM[1], this patch calls elog for each context.

As with MemoryContextStatsPrint(), each context shows 100
children at most.
I once thought it should be configurable, but something like
pg_print_backend_memory_contexts(PID, num_children) needs to send
the 'num_children' from requestor to dumper and it seems to require
another infrastructure.
Creating a new GUC for this seems overkill.
If MemoryContextStatsPrint(), i.e. showing 100 children at most is
enough, this hard limit may be acceptable.

Only superusers can call pg_print_backend_memory_contexts().

I'm going to add documentation and regression tests.


Any thoughts?


[1] 
https://www.postgresql.org/message-id/CAMsr%2BYGh%2Bsso5N6Q%2BFmYHLWC%3DBPCzA%2B5GbhYZSGruj2d0c7Vvg%40mail.gmail.com

"r_d/strengthen_perf/print_memcon.md" 110L, 5642C written


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONdiff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..e116f4a1be 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -30,6 +30,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/memutils.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -440,6 +441,20 @@ HandleProcSignalBarrierInterrupt(void)
 	/* latch will be set by procsignal_sigusr1_handler */
 }
 
+/*
+ * HandleProcSignalPrintMemoryContext
+ *
+ * Handle receipt of an interrupt indicating print memory context.
+ * Signal handler portion of interrupt handling.
+ */
+static void
+HandleProcSignalPrintMemoryContext(void)
+{
+	InterruptPending = true;
+	PrintMemoryContextPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
+
 /*
  * Perform global barrier related interrupt checking.
  *
@@ -580,6 +595,25 @@ ProcessProcSignalBarrier(void)
 	ConditionVariableBroadcast(&MyProcSignalSlot->pss_barrierCV);
 }
 
+/*
+ * ProcessPrintMemoryContextInterrupt
+ *		The portion of print memory context interrupt handling that runs
+ *		outside of the signal handler.
+ */
+void
+ProcessPrintMemoryContextInterrupt(void)
+{
+	PrintMemoryContextPending = false;
+
+	ereport(LOG,
+		(errmsg("Printing memory contexts of PID %d", MyProcPid)));
+
+	/* A har

Re: pgsql: Add libpq pipeline mode support to pgbench

2021-03-17 Thread Daniel Verite

Fabien COELHO wrote:

> For consistency with the existing \if … \endif, ISTM that it could have 
> been named \batch … \endbatch or \pipeline … \endpipeline?

"start" mirrors "end". To me, the analogy with \if-\endif is not
obvious.
Grammatically \if is meant to introduce the expression after it,
whereas \startpipeline takes no argument.
Functionally \startpipeline can be thought as "open the valve"
and \endpipeline "close the valve". They're "call-to-action" kinds of
commands, and in that sense quite different from the \if-\endif pair.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: Calendar support in localization

2021-03-17 Thread Surafel Temesgen

On Tue, Mar 16, 2021 at 12:20 PM Thomas Munro 
wrote:

> On Wed, Mar 17, 2021 at 6:31 AM Surafel Temesgen 
> wrote:
> > Ethiopice calendar have 13 months so it can not be stored as date and
> timestamp type and you approach seems more complicated and i suggest to
> have this feature on the purpose of PostgreSQL popularities too not only
> for my need
>
> I know, but the DATE and TIMESTAMPTZ datatypes don't intrinsically
> know anything about months or other calendar concepts.  Internally,
> they are just a single number that counts the number of days or
> seconds since an arbitrary epoch time.  We are all in agreement about
> how many times the Earth has rotated since then*.  The calendar
> concepts such as "day", "month", "year", whether Gregorian, Ethiopic,
> Islamic, ... are all derivable from those numbers, if you know the
> rules.
>
>
Okay


>
>
> > I think you suggesting this by expecting the implementation is difficult
> but it's not that much difficult once you fully understand Gregorian
> calendar and the calendar you work on
>
> Yeah, I am sure it's all just a bunch of simple integer maths.  But
> I'm talking about things like software architecture, maintainability,
> cohesion, and getting maximum impact for the work we do.
>
> I may be missing some key detail though: why do you think it should be
> a different type?  The two reasons I can think of are: (1) the
> slightly tricky detail that the date apparently changes at 1:00am
> (which I don't think is a show stopper for this approach, I could
> elaborate), (2) you may want dates to be formatted on the screen with
> the Ethiopic calendar in common software like psql and GUI clients,
> which may be easier to arrange with different types, but that seems to
> be a cosmetic thing that could eventually be handled with tighter
> locale integration with ICU.  In the early stages you'd access
> calendar logic though special functions with names like
> icu_format_date(), or whatever.
>
>
As you mention above whatever the calendar type is we ended up storing an
integer that  represent the date so rather than re-implementing every
function and operation for every calendar we can use existing Gerigorian
implementation as a base implementation and if new calendar want to perform
same function or operation it translate to Gregorian one and use the
existing function and operation and translate to back to working calendar.
In this approach the only function we want for supporting a new calendar is
a translator from the new calendar to Gregorian one and from Gerigorian
calendar to the new calendar and may be input/ output function. What do you
think of this implementation?

regards
Surafel

Re: Calendar support in localization

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 9:54 AM Surafel Temesgen  wrote:
> As you mention above whatever the calendar type is we ended up storing an 
> integer that  represent the date so rather than re-implementing every 
> function and operation for every calendar we can use existing Gerigorian 
> implementation as a base implementation and if new calendar want to perform 
> same function or operation it translate to Gregorian one and use the existing 
> function and operation and translate to back to working calendar. In this 
> approach the only function we want for supporting a new calendar is a 
> translator from the new calendar to Gregorian one and from Gerigorian 
> calendar to the new calendar and may be input/ output function. What do you 
> think of this implementation?

I'm not sure what the best way of tackling this problem is, but I
wanted to mention another possible approach: instead of actually using
the timestamptz data type, have another data type that is
binary-compatible with timestamptz - that is, it's the same thing on
disk, so you can cast between the two data types for free. Then have
separate input/output functions for it, separate operators and
functions and so forth.

It's not very obvious how to scale this kind of approach to a wide
variety of calendar types, and as Thomas says, it would much cooler to
be able to handle all of the ones that ICU knows how to support rather
than just one. But, the problem I see with using timestamptz is that
it's not so obvious how to get a different output format ... unless, I
guess, we can cram it into DateStyle. And it's also much less obvious
how you get the other functions and operators to do what you want, if
it's different.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
>> What I meant was that I didn't understand how there's not a leak
>> danger when compilation fails halfway through, given that the context
>> in question is below TopMemoryContext and that I didn't see a relevant
>> TRY block. But that probably there is something cleaning it up that I
>> didn't see.

> Looks like it's an actual leak:

Yeah, I believe that.  On the other hand, I'm not sure that such cases
represent any real problem for production usage.  I'm inclined to focus
on non-error scenarios first.

(Having said that, we probably have the ability to fix such things
relatively painlessly now, by reparenting an initially-temporary
context once we're done parsing.)

Meanwhile, I'm still trying to understand why valgrind is whining
about the rd_indexcxt identifier strings.  AFAICS it shouldn't.

regards, tom lane

Re: Calendar support in localization

2021-03-17 Thread Tom Lane

Robert Haas  writes:
> It's not very obvious how to scale this kind of approach to a wide
> variety of calendar types, and as Thomas says, it would much cooler to
> be able to handle all of the ones that ICU knows how to support rather
> than just one. But, the problem I see with using timestamptz is that
> it's not so obvious how to get a different output format ... unless, I
> guess, we can cram it into DateStyle. And it's also much less obvious
> how you get the other functions and operators to do what you want, if
> it's different.

Yeah, I'm afraid that it probably is different.  The most obvious
example is in operations involving type interval:
select now() + '1 month'::interval;
That should almost certainly give a different answer when using a
different calendar --- indeed the units of interest might not even
be the same.  (Do all human calendars use the concept of months?)

I don't feel like DateStyle is chartered to affect the behavior
of datetime operators; it's understood as tweaking the I/O behavior
only.  There might be more of a case for letting LC_TIME choose
this behavior, but I bet the relevant standards only contemplate
Gregorian calendars.  Also, the SQL spec says in so many words
that the SQL-defined datetime types follow the Gregorian calendar.

So on the whole, new datatypes and operators seem like the way
to go.  I concur that if ICU has solved this problem, piggybacking
on it seems more promising than writing our own code.

regards, tom lane

RE: libpq debug log

2021-03-17 Thread iwata....@fujitsu.com

Hi Tsunakawa san, Alvaro san,

Thank you very much for your review. It helped me a lot to make the patch 
better.
I update patch to v26. 
This patch has been updated according to Tsunakawa san and Alvaro san review 
comments.

The output is following;
```
2021-03-17 14:43:16.411238  >   Terminate   4   
  
2021-03-17 14:43:16.441775  >   Query   155 "CREATE TABLESPACE 
regress_tblspacewith LOCATION 
'/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH 
(some_nonexistent_parameter = true);"
2021-03-17 14:43:16.442935  <   ErrorResponse   124 S "ERROR" V "ERROR" C 
"22023" M "unrecognized parameter "some_nonexistent_parameter"" F 
"reloptions.c" L "1447" R "parseRelOptionsInternal" \x00 "Z"
2021-03-17 14:43:16.442977  <   ReadyForQuery   5 I 
  
2021-03-17 14:43:16.443042  >   Query   144 "CREATE TABLESPACE 
regress_tblspacewith LOCATION 
'/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH 
(random_page_cost = 3.0);"
2021-03-17 14:43:16.444774  <   CommandComplete 22 "CREATE TABLESPACE"  
  
2021-03-17 14:43:16.444807  <   ReadyForQuery   5 I 
2021-03-17 14:43:16.444878  >   Query   81 "SELECT spcoptions FROM 
pg_tablespace WHERE spcname = 'regress_tblspacewith';"
2021-03-17 14:43:16.448117  <   RowDescription  35 1 "spcoptions" 1213 5 1009 
65535 -1 0
2021-03-17 14:43:16.448157  <   DataRow 32 1 22 '{random_page_cost=3.0}'
  
2021-03-17 14:43:16.448171  <   CommandComplete 13 "SELECT 1"   
```

> -Original Message-
> From: Tsunakawa, Takayuki/綱川 貴之 
> Sent: Tuesday, March 16, 2021 12:03 PM
 
 
> (1)
> -  Enables  tracing of the client/server communication to a debugging
> file stream.
> +  Enables tracing of the client/server communication to a debugging file
> +  stream.
> +  Only available when protocol version 3 or higher is used.
> 
> The last line is unnecessary now that v2 is not supported.

I removed last comment from documentation file.

> (2)
> @@ -113,6 +114,7 @@ install: all installdirs install-lib
>   $(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)'
>   $(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)'
>   $(INSTALL_DATA) $(srcdir)/libpq-int.h
> '$(DESTDIR)$(includedir_internal)'
> + $(INSTALL_DATA) $(srcdir)/libpq-trace.h
> '$(DESTDIR)$(includedir_internal)'
>   $(INSTALL_DATA) $(srcdir)/pqexpbuffer.h
> '$(DESTDIR)$(includedir_internal)'
>   $(INSTALL_DATA) $(srcdir)/pg_service.conf.sample
> '$(DESTDIR)$(datadir)/pg_service.conf.sample'
> 
> Why is this necessary?

I think it is not necessary. I removed it.

> (3) libpq-trace.h
> +#ifdef __cplusplus
> +extern "C"
> +{
> 
> This is unnecessary because pqTraceOutputMessage() is for libpq's internal
> use.  This extern is for the user's C++ application to call public libpq
> functions.
> 
> +#include "libpq-fe.h"
> +#include "libpq-int.h"
> 
> I think these can be removed by declaring the struct and function as follows:
> 
> struct pg_conn;
> extern void pqTraceOutputMessage(struct pg_conn *conn, const char
> *message, bool toServer);
> 
> But... after doing the above, only this declaration is left in libpq-trace.h. 
>  Why
> don't you put your original declaration using PGconn in libpq-int.h and remove
> libpq-trace.h?

I remove this file.
I was wondering whether to delete this file during the development of v25 
patch. I leave it because it keep scalability.
If tracing process become update and it have a lot of extern function, leave 
this header file is meaningful.
However I think it does not happen. So I remove it.

> (4)
> + /* Trace message only when there is first 1 byte */
> + if (conn->Pfdebug && (conn->outCount < conn->outMsgStart))
> + pqTraceOutputMessage(conn, conn->outBuffer +
> conn->outCount, true);
> 
> () surrounding the condition after && can be removed.

I removed this (). And This if () statement has been modified according to the 
review comment (14).

> (5)
> +static const char *pqGetProtocolMsgType(unsigned char c,
> +
>   bool toServer);
> 
> This is unnecessary because the function definition precedes its only one call
> site.

Yes, I removed this definition.

> (6)
> + * We accumulate frontend message pieces in an array as the libpq code
> + writes
> + * them, and log the complete message when pqTraceOutputFeMsg is called.
> + * For backend, we print the pieces as soon as we receive them from the
> server.
> 
> The first paragraph is no longer true.  I think this entire comment is
> unnecessary.

I removed this explanation. 

> (7)
> +static const char *
> +pqGetProtocolMsgType(unsigned char c, bool toServer) {
> + if (toServer == true &&
> + c < lengthof(protocol_message_type_f) &&
> + protocol_message_type_f[c] != NULL)
> + return protocol_message_type_f[c];
> +
> + if (toServer == false &&
> + c < lengthof(protoco

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
> Meanwhile, I'm still trying to understand why valgrind is whining
> about the rd_indexcxt identifier strings.  AFAICS it shouldn't.

I found a way around that late last night. Need to mark the context itself as 
an allocation. But I made a mess on the way to that and need to clean the patch 
up before sending it (and need to drop my girlfriend off first).

Andres

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Sun, Mar 14, 2021 at 04:06:45PM +0800, Julien Rouhaud wrote:
> Recent conflict, thanks to cfbot.  v18 attached.

We are reaching the two-year mark on this feature, that everyone seems
to agree is needed.  Is any committer going to work on this to get it
into PG 14?  Should I take it?

I just read the thread and I didn't see any open issues.  Are there any?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Tom Lane

Bruce Momjian  writes:
> We are reaching the two-year mark on this feature, that everyone seems
> to agree is needed.  Is any committer going to work on this to get it
> into PG 14?  Should I take it?

I still say that it's a serious mistake to sanctify a query ID calculation
method that was designed only for pg_stat_statement's needs as the one
true way to do it.  But that's what exposing it in a core view would do.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > We are reaching the two-year mark on this feature, that everyone seems
> > to agree is needed.  Is any committer going to work on this to get it
> > into PG 14?  Should I take it?
> 
> I still say that it's a serious mistake to sanctify a query ID calculation
> method that was designed only for pg_stat_statement's needs as the one
> true way to do it.  But that's what exposing it in a core view would do.

OK, I am fine with creating a new method, and maybe having
pg_stat_statements use it.  Is that the direction we should be going in?
I do think we need _some_ method in core if we are going to be exposing
this value in pg_stat_activity and log_line_prefix.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  wrote:
>
> 2) ndistinct
>
> There's one thing that's bugging me, in how we handle "partial" matches.
> For each expression we track both the original expression and the Vars
> we extract from it. If we can't find a statistics matching the whole
> expression, we try to match those individual Vars, and we remove the
> matching ones from the list. And in the end we multiply the estimates
> for the remaining Vars.
>
> This works fine with one matching ndistinct statistics. Consider for example
>
>  GROUP BY (a+b), (c+d)
>
> with statistics on [(a+b),c] - that is, expression and one column.

I've just been going over this example, and I think it actually works
slightly differently from how you described, though I haven't worked
out the full general implications of that.

> We parse the expressions into two GroupExprInfo
>
>  {expr: (a+b), vars: [a, b]}
>  {expr: (c+d), vars: [c, d]}
>

Here, I think what you actually get, in the presence of stats on
[(a+b),c] is actually the following two GroupExprInfos:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: [c, d]}

because of the code immediately after this comment in estimate_num_groups():

/*
 * If examine_variable is able to deduce anything about the GROUP BY
 * expression, treat it as a single variable even if it's really more
 * complicated.
 */

As it happens, that makes no difference in this case, where there is
just this one stats object, but it does change things when there are
two stats objects.

> and the statistics matches the first item exactly (the expression). The
> second expression is not in the statistics, but we match "c". So we end
> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>
>  {expr: (c+d), vars: [d]}

Right.

> Without any other statistics we estimate that as ndistinct for "d", so
> we end up with
>
>  ndistinct((a+b), c) * ndistinct(d)
>
> which mostly makes sense. It assumes ndistinct(c+d) is product of the
> ndistinct estimates, but that's kinda what we've been always doing.

Yes, that appears to be what happens, and it's probably the best that
can be done with the available stats.

> But now consider we have another statistics on just (c+d). In the second
> loop we end up matching this expression exactly, so we end up with
>
>  ndistinct((a+b), c) * ndistinct((c+d))

In this case, with stats on (c+d) as well, the two GroupExprInfos
built at the start change to:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: []}

so it actually ends up not using any multivariate stats, but instead
uses the two univariate expression stats, giving

 ndistinct((a+b)) * ndistinct((c+d))

which actually seems pretty good, and gave very good estimates in the
simple test case I tried.

> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
> what we should do after the first loop is just discarding the whole
> expression and "expand" into per-variable GroupExprInfo, so in the
> second step we would not match the (c+d) statistics.

Not using the (c+d) stats would give either

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

or

 ndistinct((a+b), c) * ndistinct(d)

depending on exactly how the algorithm was changed. In my test, these
both gave worse estimates, but there are probably other datasets for
which they might do better. It all depends on how much correlation
there is between (a+b) and c.

I suspect that there is no optimal strategy for handling overlapping
stats that works best for all datasets, but the current algorithm
seems to do a pretty decent job.

> Of course, maybe there's a better way to pick the statistics, but I
> think our conclusion so far was that people should just create
> statistics covering all the columns in the query, to not have to match
> multiple statistics like this.

Yes, I think that's always likely to work better, especially for
ndistinct stats, where all possible permutations of subsets of the
columns are included, so a single ndistinct stat can work well for a
range of different queries.

Regards,
Dean

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Tom Lane

Bruce Momjian  writes:
> On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
>> I still say that it's a serious mistake to sanctify a query ID calculation
>> method that was designed only for pg_stat_statement's needs as the one
>> true way to do it.  But that's what exposing it in a core view would do.

> OK, I am fine with creating a new method, and maybe having
> pg_stat_statements use it.  Is that the direction we should be going in?

The point is that we've understood Query.queryId as something that
different extensions might calculate differently for their own needs.
In particular it's easy to imagine extensions that want an ID that is
less fuzzy than what pg_stat_statements wants.  We never had a plan for
how two such extensions could co-exist, but at least it was possible
to use one if you didn't use another.  If this gets moved into core
then there will basically be only one way that anyone can do it.

Maybe what we need is a design for allowing more than one query ID.

> I do think we need _some_ method in core if we are going to be exposing
> this value in pg_stat_activity and log_line_prefix.

I'm basically objecting to the conclusion that we should do either
of those.  There is no way around the fact that it will break every
user of Query.queryId other than pg_stat_statements, unless they
are okay with whatever definition pg_stat_statements is using (which
is a moving target BTW).

regards, tom lane

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 9:20 odesílatel Michael Paquier 
napsal:

> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
> > I also think that there should be a single usable top label, otherwise
> it will
> > lead to confusing code and it can be a source of bug.
>
> Okay, that's fine by me.  Could it be possible to come up with an
> approach that does not hijack the namespace list contruction and the
> lookup logic as much as it does now?  I get it that the patch is done
> this way because of the ordering of operations done for the initial ns
> list creation and the grammar parsing that adds the routine label on
> top of the root one, but I'd like to believe that there are more solid
> ways to achieve that, like for example something that decouples the
> root label and its alias, or something that associates an alias
> directly with its PLpgSQL_nsitem entry?
>

I am checking it now, and I don't see any easy solution. The namespace is a
one direction tree - only backward iteration from leafs to roots is
supported. At the moment, when I try to replace the name of root ns_item,
this root ns_item is referenced by the function's arguments (NS_TYPE_VAR
type). So anytime I need some  helper ns_item node, that can be a
persistent target for these nodes. In this case is a memory overhead of
just one ns_item.

orig_label <- argument1 <- argument2

The patch has to save the original orig_label because it can be referenced
from argument1 or by "FOUND" variable. New graphs looks like

new_label <- invalidated orig_label <- argument1 <- ...

This tree has a different direction than is usual, and then replacing the
root node is not simple.

Second solution (significantly more simple) is an additional pointer in
ns_item structure. In almost all cases this pointer will not be used.
Because ns_item uses a flexible array, then union cannot be used. I
implemented this in a patch marked as "alias-implementation".

What do you think about it?

Pavel







> --
> Michael
>
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 9242c54329..ed8e774899 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -292,7 +292,9 @@ $$ LANGUAGE plpgsql;
   special variables such as FOUND (see
   ).  The outer block is
   labeled with the function's name, meaning that parameters and special
-  variables can be qualified with the function's name.
+  variables can be qualified with the function's name. The name of this outer
+  block can be changed by inserting special command 
+  #routine_label new_name at the start of the function.
  
 
 
@@ -435,6 +437,31 @@ $$ LANGUAGE plpgsql;
  
 
 
+
+ The function's argument can be qualified with the function name:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+BEGIN
+RETURN sales_tax.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
+
+ Sometimes the function name is too long and it can be more practical to use
+ some shorter label. The top namespace label can be changed (along with
+ the functions' arguments) using the option routine_label:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+#routine_label s
+BEGIN
+RETURN s.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
  
   Some more examples:
 
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index ce8d97447d..d32e050c32 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo,
 	 */
 	plpgsql_ns_init();
 	plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK);
+
+	/* save top ns for possibility to alter top label */
+	function->root_ns = plpgsql_ns_top();
+
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 919b968826..7132da35d1 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -101,6 +101,7 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 	nse->itemtype = itemtype;
 	nse->itemno = itemno;
 	nse->prev = ns_top;
+	nse->alias = NULL;
 	strcpy(nse->name, name);
 	ns_top = nse;
 }
@@ -141,7 +142,7 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 			 nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
 			 nsitem = nsitem->prev)
 		{
-			if (strcmp(nsitem->name, name1) == 0)
+			if (strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 			{
 if (name2 == NULL ||
 	nsitem->itemtype != PLPGSQL_NSTYPE_VAR)
@@ -155,13 +156,13 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 
 		/* Check this level for qualified match to variable name */
 		if (name2 != NULL &&
-			strcmp(nsitem->name, name1) == 0)
+			strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 		{
 			for (nsitem = ns_cur;
  nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
  nsitem = nsitem->prev)
 			{
-if (st

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra

On 2/15/21 12:18 AM, Stephen Frost wrote:
> Greetings,
> 
> ...
>
 I think there's potential for some significant optimization going
 forward, but I think it's basically optimization over what we're doing
 today. As this is already a nontrivial patch, I'd argue for doing so
 separately.
>>>
>>> This seems like a great optimization, albeit a fair bit of code, for a
>>> relatively uncommon use-case, specifically where full page writes are
>>> disabled or very large checkpoints.  As that's the case though, I would
>>> think it's reasonable to ask that it go out of its way to avoid slowing
>>> down the more common configurations, particularly since it's proposed to
>>> have it on by default (which I agree with, provided it ends up improving
>>> the common cases, which I think the suggestions above would certainly
>>> make it more likely to do).
>>
>> I'm OK to do some benchmarking, but it's not quite clear to me why does it
>> matter if the checkpoints are smaller than shared buffers? IMO what matters
>> is how "localized" the updates are, i.e. how likely it is to hit the same
>> page repeatedly (in a short amount of time). Regular pgbench is not very
>> suitable for that, but some non-uniform distribution should do the trick, I
>> think.
> 
> I suppose strictly speaking it'd be
> Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
> it's more about the wal_decode_buffer_size than the checkpoint's size.
> Apologies for the confusion.  As suggested above, one way to benchmark
> this to really see if there's any issue would be to increase
> wal_decode_buffer_size to some pretty big size and then compare the
> performance vs. unpatched.  I'd think that could even be done with
> pgbench, so you're not having to arrange for the same pages to get
> updated over and over.
> 

What exactly would be the point of such benchmark? I don't think the
patch does prefetching based on wal_decode_buffer_size, that just says
how far ahead we decode - the prefetch distance I is defined by
maintenance_io_concurrency.

But it's not clear to me what exactly would the result say about the
necessity of the optimization at hand (skipping prefetches for blocks
with recent FPI). If the the maintenance_io_concurrency is very high,
the probability that a block is evicted prematurely grows, making the
prefetch useless in general. How does this say anything about the
problem at hand? Sure, we'll do unnecessary I/O, causing issues, but
that's a bit like complaining the engine gets very hot when driving on a
highway in reverse.

AFAICS to measure the worst case, you'd need a workload with a lot of
FPIs, and very little actual I/O. That means, data set that fits into
memory (either shared buffers or RAM), and short checkpoints. But that's
exactly the case where you don't need prefetching ...

>>> Perhaps this already improves the common cases and is worth the extra
>>> code on that basis, but I don't recall seeing much in the way of
>>> benchmarking in this thread for that case- that is, where FPIs are
>>> enabled and checkpoints are smaller than shared buffers.  Jakub's
>>> testing was done with FPWs disabled and Tomas's testing used checkpoints
>>> which were much larger than the size of shared buffers on the system
>>> doing the replay.  While it's certainly good that this patch improves
>>> those cases, we should also be looking out for the worst case and make
>>> sure that the patch doesn't degrade performance in that case.
>>
>> I'm with Andres on this. It's fine to leave some possible optimizations on
>> the table for the future. And even if some workloads are affected
>> negatively, it's still possible to disable the prefetching.
> 
> While I'm generally in favor of this argument, that a feature is
> particularly important and that it's worth slowing down the common cases
> to enable it, I dislike that it's applied inconsistently.  I'd certainly

If you have a workload where this happens to cause issues, you can just
disable that. IMHO that's a perfectly reasonable engineering approach,
where we get something that significantly improves 80% of the cases,
allow disabling it for cases where it might cause issues, and then
improve it in the next version.


> feel better about it if we had actual performance numbers to consider.
> I don't doubt the possibility that the extra prefetch's just don't
> amount to enough to matter but I have a hard time seeing them as not
> having some cost and without actually measuring it, it's hard to say
> what that cost is.
> 
> Without looking farther back than the last record, we could end up
> repeatedly asking for the same blocks to be prefetched too-
> 
> FPI for block 1
> FPI for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> 
> ... etc.
> 
> Entirely possible my math is off, but seems like the worst case
> situation right now might end up with some 4500 u

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 12:01:38PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would do.
> 
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
> 
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.

Well, the patch docs say:

Enables or disables in core query identifier computation.arameter.  The
 extension requires a query
--> identifier to be computed.  Note that an external module can
--> alternatively be used if the in core query identifier computation
specification doesn't suit your need.  In this case, in core
computation must be disabled.  The default is off.

> Maybe what we need is a design for allowing more than one query ID.
> 
> > I do think we need _some_ method in core if we are going to be exposing
> > this value in pg_stat_activity and log_line_prefix.
> 
> I'm basically objecting to the conclusion that we should do either
> of those.  There is no way around the fact that it will break every
> user of Query.queryId other than pg_stat_statements, unless they
> are okay with whatever definition pg_stat_statements is using (which
> is a moving target BTW).

I thought the above doc patch feature avoided this problem because an
extension can override the build-in query id.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:

> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID
> calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would do.
>
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
>
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.
>
> Maybe what we need is a design for allowing more than one query ID.
>

Theoretically there can be a hook for calculation of queryid, that can be
by used extension. Default can be assigned with a method that is used by
pg_stat_statements.

I don't think it is possible to use more different query id for
pg_stat_statements so this solution can be simple.

regards

Pavel





>
> > I do think we need _some_ method in core if we are going to be exposing
> > this value in pg_stat_activity and log_line_prefix.
>
> I'm basically objecting to the conclusion that we should do either
> of those.  There is no way around the fact that it will break every
> user of Query.queryId other than pg_stat_statements, unless they
> are okay with whatever definition pg_stat_statements is using (which
> is a moving target BTW).
>
> regards, tom lane
>
>
>

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote:
> 
> 
> st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:
> 
> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID
> calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would 
> do.
> 
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
> 
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.
> 
> Maybe what we need is a design for allowing more than one query ID.
> 
> 
> Theoretically there can be a hook for calculation of queryid, that can be by
> used extension. Default can be assigned with a method that is used by
> pg_stat_statements.

Yes, that is what the code patch says it does.

> I don't think it is possible to use more different query id for
> pg_stat_statements so this solution can be simple.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: libpq debug log

2021-03-17 Thread Alvaro Herrera

Hello

In pqTraceOutputString(), you can use the return value from fprintf to
move the cursor -- no need to count chars.

I still think that the message-type specific functions should print the
message type, rather than having the string arrays.

-- 
Álvaro Herrera   Valdivia, Chile
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Julien Rouhaud

On Wed, Mar 17, 2021 at 12:24:44PM -0400, Bruce Momjian wrote:
> On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote:
> > 
> > 
> > st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:
> > 
> > Bruce Momjian  writes:
> > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> > >> I still say that it's a serious mistake to sanctify a query ID
> > calculation
> > >> method that was designed only for pg_stat_statement's needs as the 
> > one
> > >> true way to do it.  But that's what exposing it in a core view would 
> > do.
> > 
> > > OK, I am fine with creating a new method, and maybe having
> > > pg_stat_statements use it.  Is that the direction we should be going 
> > in?
> > 
> > The point is that we've understood Query.queryId as something that
> > different extensions might calculate differently for their own needs.
> > In particular it's easy to imagine extensions that want an ID that is
> > less fuzzy than what pg_stat_statements wants.  We never had a plan for
> > how two such extensions could co-exist, but at least it was possible
> > to use one if you didn't use another.  If this gets moved into core
> > then there will basically be only one way that anyone can do it.
> > 
> > Maybe what we need is a design for allowing more than one query ID.
> > 
> > 
> > Theoretically there can be a hook for calculation of queryid, that can be by
> > used extension. Default can be assigned with a method that is used by
> > pg_stat_statements.
> 
> Yes, that is what the code patch says it does.
> 
> > I don't think it is possible to use more different query id for
> > pg_stat_statements so this solution can be simple.
> 
> Agreed.

Actually, putting the query identifer computation in the core makes it way more
tunable, even if it's conterintuitive.  What it means is that you can now chose
to use usual pgss' algorithm or a different one for log_line_prefix and
pg_stat_activity.queryid, but also that you can now use pgss with a different
query id algorithm.  That's another thing that user were asking for a long
time.

I originally suggested to make it clearer by having an enum GUC rather than a
boolean, say compute_queryid = [ none | core | external ], and if set to
external then a hook would be explicitely called.  Right now, "none" and
"external" are binded with compute_queryid = off, and depends on whether an
extension is computing a queryid during post_parse_analyse_hook.

It could later be extended to suit other needs if we ever come to some
agreement (for instance "legacy", "logical_replication_stable" or whatever
better name we can find for something that doesn't depend on Oid).

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Dmitry Dolgov

> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
> Hi,
>
> I took a look at the new patch series, focusing mostly on the uniquekeys
> part. It'd be a bit tedious to explain all the review comments here, so
> attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

> Most of it is fairly small (corrections to comments etc.), I'll go over
> the more serious part so that we can discuss it here. I'll keep it split
> per parts of the original patch series.
> I suggest looking for XXX and FIXME comments in all the review patches.
>
>
> 0001
> 
>
> 
>
> 0002
> 
>

In fact both 0001 & 0002 belong to another thread, which these days
span [1], [2]. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

> 0003
> 
>
> Just some comments/whitespace.
>
>
> 0004
> 
>
> I wonder why we don't include this in explain TEXT format? Seems it
> might make it harder to write regression tests for this? It's easier to
> just check that we deduced the right unique key(s) than having to
> construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

[1]: 
https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com

pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Lætitia Avrot

Hey hackers,

I had this idea, that I raised and cherished like my baby to add a switch
in `pg_dump` to allow exporting stored functions (and procedures) only.

However, when I finally got the time to look at it in detail, I found out
there was no way to solve the dependencies in the functions and procedures,
so that the exported file, when re-played could lead to invalid objects.

So, I decided this would not make Postgres better and decide to walk off
this patch.

Anyhow, when sharing my thoughts, several people told me to ask the
community about adding this feature because this could be useful in some
use cases. Another argument is that should you have all your functions in
one schema and your tables in another, exporting only the function schema
will lead to the same kind of file that could lead to invalid objects
created when the file is re-played against a database that does not have
the tables.

Of course, the documentation would add a warning against object invalidity
should only the stored functions/procedures be exported.

So, my question is: what do you think about such a feature? is it worth it?

Have a nice day,

Lætitia

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Vik Fearing

On 3/17/21 6:00 PM, Lætitia Avrot wrote:
> Hey hackers,
> 
> I had this idea, that I raised and cherished like my baby to add a switch
> in `pg_dump` to allow exporting stored functions (and procedures) only.
> 
> However, when I finally got the time to look at it in detail, I found out
> there was no way to solve the dependencies in the functions and procedures,
> so that the exported file, when re-played could lead to invalid objects.
> 
> So, I decided this would not make Postgres better and decide to walk off
> this patch.
> 
> Anyhow, when sharing my thoughts, several people told me to ask the
> community about adding this feature because this could be useful in some
> use cases. Another argument is that should you have all your functions in
> one schema and your tables in another, exporting only the function schema
> will lead to the same kind of file that could lead to invalid objects
> created when the file is re-played against a database that does not have
> the tables.
> 
> Of course, the documentation would add a warning against object invalidity
> should only the stored functions/procedures be exported.
> 
> So, my question is: what do you think about such a feature? is it worth it?

Yes, it is absolutely worth it to be able to extract just the functions
from a database.  I have wanted this several times.
-- 
Vik Fearing

Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-17 Thread Tom Lane

I wrote:
> I took a stab at doing that, just to see what it might look like.
> I thought it comes out pretty well, really -- see what you think.

Hearing nothing further, I pushed that after another round of
copy-editing.  There's still plenty of time to revise it if
anybody has further comments.

regards, tom lane

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Fabrízio de Royes Mello

On Wed, Mar 17, 2021 at 2:00 PM Lætitia Avrot 
wrote:
>
> Hey hackers,
>
> I had this idea, that I raised and cherished like my baby to add a switch
in `pg_dump` to allow exporting stored functions (and procedures) only.
>
> However, when I finally got the time to look at it in detail, I found out
there was no way to solve the dependencies in the functions and procedures,
so that the exported file, when re-played could lead to invalid objects.
>
> So, I decided this would not make Postgres better and decide to walk off
this patch.
>
> Anyhow, when sharing my thoughts, several people told me to ask the
community about adding this feature because this could be useful in some
use cases. Another argument is that should you have all your functions in
one schema and your tables in another, exporting only the function schema
will lead to the same kind of file that could lead to invalid objects
created when the file is re-played against a database that does not have
the tables.
>
> Of course, the documentation would add a warning against object
invalidity should only the stored functions/procedures be exported.
>
> So, my question is: what do you think about such a feature? is it worth
it?
>

Make total sense since we already have --function=NAME(args) on pg_restore
and it doesn't solve dependencies... so we can add it to also export
function/procedure contents.

+1 for general idea.

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Tom Lane

Vik Fearing  writes:
> On 3/17/21 6:00 PM, Lætitia Avrot wrote:
>> However, when I finally got the time to look at it in detail, I found out
>> there was no way to solve the dependencies in the functions and procedures,
>> so that the exported file, when re-played could lead to invalid objects.
>>  ...
>> So, my question is: what do you think about such a feature? is it worth it?

> Yes, it is absolutely worth it to be able to extract just the functions
> from a database.  I have wanted this several times.

Selective dumps always have a risk of not being restorable on their
own; I don't see that "functions only" is noticeably less safe than
"just these tables", or other cases that we support already.

What I'm wondering about is how this might interact with the
discussion at [1].

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAFj8pRB10wvW0CC9Xq=1XDs=zcqxer3cblcnza+qix4cuh-...@mail.gmail.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra

Hi,

On 3/17/21 4:55 PM, Dean Rasheed wrote:
> On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  
> wrote:
>>
>> 2) ndistinct
>>
>> There's one thing that's bugging me, in how we handle "partial" matches.
>> For each expression we track both the original expression and the Vars
>> we extract from it. If we can't find a statistics matching the whole
>> expression, we try to match those individual Vars, and we remove the
>> matching ones from the list. And in the end we multiply the estimates
>> for the remaining Vars.
>>
>> This works fine with one matching ndistinct statistics. Consider for example
>>
>>  GROUP BY (a+b), (c+d)
>>
>> with statistics on [(a+b),c] - that is, expression and one column.
> 
> I've just been going over this example, and I think it actually works
> slightly differently from how you described, though I haven't worked
> out the full general implications of that.
> 
>> We parse the expressions into two GroupExprInfo
>>
>>  {expr: (a+b), vars: [a, b]}
>>  {expr: (c+d), vars: [c, d]}
>>
> 
> Here, I think what you actually get, in the presence of stats on
> [(a+b),c] is actually the following two GroupExprInfos:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: [c, d]}
> 

Yeah, right. To be precise, we get

{expr: (a+b), vars: [(a+b)]}

because in the first case we pass NIL, so add_unique_group_expr treats
the whole expression as a var (a bit strange, but OK).

> because of the code immediately after this comment in estimate_num_groups():
> 
> /*
>  * If examine_variable is able to deduce anything about the GROUP BY
>  * expression, treat it as a single variable even if it's really more
>  * complicated.
>  */
> 
> As it happens, that makes no difference in this case, where there is
> just this one stats object, but it does change things when there are
> two stats objects.
> 
>> and the statistics matches the first item exactly (the expression). The
>> second expression is not in the statistics, but we match "c". So we end
>> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>>
>>  {expr: (c+d), vars: [d]}
> 
> Right.
> 
>> Without any other statistics we estimate that as ndistinct for "d", so
>> we end up with
>>
>>  ndistinct((a+b), c) * ndistinct(d)
>>
>> which mostly makes sense. It assumes ndistinct(c+d) is product of the
>> ndistinct estimates, but that's kinda what we've been always doing.
> 
> Yes, that appears to be what happens, and it's probably the best that
> can be done with the available stats.
> 
>> But now consider we have another statistics on just (c+d). In the second
>> loop we end up matching this expression exactly, so we end up with
>>
>>  ndistinct((a+b), c) * ndistinct((c+d))
> 
> In this case, with stats on (c+d) as well, the two GroupExprInfos
> built at the start change to:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: []}
> 
> so it actually ends up not using any multivariate stats, but instead
> uses the two univariate expression stats, giving
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 
> which actually seems pretty good, and gave very good estimates in the
> simple test case I tried.
> 

Yeah, that works pretty well in this case.

I wonder if we'd be better off extracting the Vars and doing what I
mistakenly described as the current behavior. That's essentially mean
extracting the Vars even in the case where we now pass NIL.

My concern is that the current behavior (where we prefer expression
stats over multi-column stats to some extent) works fine as long as the
parts are independent, but once there's dependency it's probably more
likely to produce underestimates. I think underestimates for grouping
estimates were a risk in the past, so let's not make that worse.

>> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
>> what we should do after the first loop is just discarding the whole
>> expression and "expand" into per-variable GroupExprInfo, so in the
>> second step we would not match the (c+d) statistics.
> 
> Not using the (c+d) stats would give either
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> or
> 
>  ndistinct((a+b), c) * ndistinct(d)
> 
> depending on exactly how the algorithm was changed. In my test, these
> both gave worse estimates, but there are probably other datasets for
> which they might do better. It all depends on how much correlation
> there is between (a+b) and c.
> 
> I suspect that there is no optimal strategy for handling overlapping
> stats that works best for all datasets, but the current algorithm
> seems to do a pretty decent job.
> 
>> Of course, maybe there's a better way to pick the statistics, but I
>> think our conclusion so far was that people should just create
>> statistics covering all the columns in the query, to not have to match
>> multiple statistics like this.
> 
> Yes, I think that's always likely to work better, especially for
> ndistinct stats, where

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar  wrote:
> 0002:
> - Wrapper over heap_form_tuple and used in ExecEvalRow() and
> ExecEvalFieldStoreForm()

Instead of having heap_form_flattened_tuple(), how about
heap_flatten_values(tupleDesc, values, isnull) that is documented to
modify the values array? Then instead of replacing the
heap_form_tuple() calls with a call to heap_form_flattened_tuple(),
you just insert a call to heap_flatten_values() before the call to
heap_form_tuple(). I think that might be easier for people looking at
this code in the future to understand what's happening.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Tomas Vondra




On 3/17/21 6:02 PM, Dmitry Dolgov wrote:
>> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> I took a look at the new patch series, focusing mostly on the uniquekeys
>> part. It'd be a bit tedious to explain all the review comments here, so
>> attached is a patch series with a "review" patch for some of the parts.
> 
> Great, thanks.
> 
>> Most of it is fairly small (corrections to comments etc.), I'll go over
>> the more serious part so that we can discuss it here. I'll keep it split
>> per parts of the original patch series.
>> I suggest looking for XXX and FIXME comments in all the review patches.
>>
>>
>> 0001
>> 
>>
>> 
>>
>> 0002
>> 
>>
> 
> In fact both 0001 & 0002 belong to another thread, which these days
> span [1], [2]. I've included them only because they happened to be a
> dependency for index skip scan following David suggestions, sorry if
> it's confusing.
> 
> At the same time the author behind 0001 & 0002 is present in this thread
> as well, maybe Andy can answer these comments right here and better than me.
> 

Ah, sorry for the confusion. In that case the review comments probably
belong to the other threads, so we should move the discussion there.
It's not clear to me which of the threads is the right one.

>> 0003
>> 
>>
>> Just some comments/whitespace.
>>
>>
>> 0004
>> 
>>
>> I wonder why we don't include this in explain TEXT format? Seems it
>> might make it harder to write regression tests for this? It's easier to
>> just check that we deduced the right unique key(s) than having to
>> construct an example where it actually changes the plan.
> 
> Yeah, good point. I believe originally it was like that to not make
> explain too verbose for skip scans, but displaying prefix definitely
> could be helpful for testing, so will do this (and address other
> comments as well).
> 

Cool. Thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-03-17 Thread Alvaro Herrera

On 2021-Mar-15, Alvaro Herrera wrote:

> Here's a fixup patch to do it that way.  I tried running the commands
> you showed and one of them immediately dies with the new error message;
> I can't cause the bogus constraint to show up anymore.

Actually, that was a silly fix that didn't actually work correctly, as I
discovered immediately after sending it.  The right fix is to forbid all
commands other than DETACH PARTITION FINALIZE in a partition that's in
the process of being detached.

In the attached v8, I did that; I also added a ton more tests that
hopefully show how the feature should work in concurrent cases,
including one case in which the transaction doing the detach is
cancelled.  I also renamed "inhdetached" to "inhdetachpending", per
previous discussion, including changing how to looks in psql.

I am not aware of any other loose end in this patch; I consider this
version final.  Barring further problem reports, I'll get this pushed
tomorrow morning.

psql completion is missing. If somebody would like to contribute that,
I'm grateful.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)
>From 101a20d1ccf86a255af166e198c7ac691af58c99 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v8 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ffb1308a0c..8e753f1efd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -156,6 +156,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4512,7 +4514,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4521,10 +4522,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lockmode, pass, context);
 
@@ -4536,7 +4537,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			if (pass == AT_PASS_ALTER_TYPE)
 ATPostAlterTypeCleanup(wqueue, tab, lockmode);
 
-			relation_close(rel, NoLock);
+			if (tab->rel)
+			{
+relation_close(tab->rel, NoLock);
+tab->rel = NULL;
+			}
 		}
 	}
 
@@ -4562,11 +4567,12 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
  * ATExecCmd: dispatch a subcommand to appropriate execution routine
  */
 static void
-ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 		  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 		  AlterTableUtilityContext *context)
 {
 	ObjectAddress address = InvalidObjectAddress;
+	Relation	rel = tab->rel;
 
 	switch (cmd->subtype)
 	{
@@ -5669,6 +5675,7 @@ ATGetQueueEntry(List **wqueue, Relation rel)
 	 */
 	tab = (AlteredTableInfo *) palloc0(sizeof(AlteredTableInfo));
 	tab->relid = relid;
+	tab->rel = NULL;			/* set later */
 	tab->relkind = rel->rd_rel->relkind;
 	tab->oldDesc = CreateTupleDescCopyConstr(RelationGetDescr(rel));
 	tab->newrelpersistence = RELPERSISTENCE_PERMANENT;
-- 
2.20.1

>From 6e893b5f2e7abae9e55ba0d1487d4621272d2cdf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

(really need to fix my mobile phone mail program to keep the CC list...)

On 2021-03-17 08:15:43 -0700, Andres Freund wrote:
> On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
> > Meanwhile, I'm still trying to understand why valgrind is whining
> > about the rd_indexcxt identifier strings.  AFAICS it shouldn't.
> 
> I found a way around that late last night. Need to mark the context
> itself as an allocation. But I made a mess on the way to that and need
> to clean the patch up before sending it (and need to drop my
> girlfriend off first).

Unfortunately I didn't immediately find a way to do this while keeping
the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
creation into the memory context implementations, "allocates" the
context itself as part of that pool, and changes the reset
implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
tracking ->ident etc), but marks all the other memory as freed.

This is just a first version, it probably needs more work, and
definitely a few comments...

After this, your changes, and the previously mentioned fixes, I get far
fewer false positives. Also found a crash / memory leak in pgstat.c due
to the new replication slot stats, but I'll start a separate thread.

There are a few leak warnings around guc.c that look like they might be
real, not false positives, and thus a bit concerning. Looks like several
guc check hooks don't bother to free the old *extra before allocating a
new one.

I suspect we might get better results from valgrind, not just for leaks
but also undefined value tracking, if we changed the way we represent
pools to utilize VALGRIND_MEMPOOL_METAPOOL |
VALGRIND_MEMPOOL_AUTO_FREE. E.g. aset.c would associate AllocBlock using
VALGRIND_MEMPOOL_ALLOC and then mcxt.c would use
VALGRIND_MALLOCLIKE_BLOCK for the individual chunk allocation.

https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools

I played with naming the allocations underlying aset.c using
VALGRIND_CREATE_BLOCK(block, strlen(context->name), context->name).
That does produce better undefined-value warnings, but it seems that
e.g. the leak detector doen't have that information around. Nor does it
seem to be usable for use-afte-free. At least the latter likely because
I had to VALGRIND_DISCARD by that point...

Greetings,

Andres Freund
>From c7e69e4bccfc4da97b0b999399732e3dc806b67e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 17 Mar 2021 10:46:39 -0700
Subject: [PATCH] Make memory contexts themselves more visible to valgrind.

---
 src/include/utils/memdebug.h|  1 +
 src/backend/utils/mmgr/aset.c   | 22 ++
 src/backend/utils/mmgr/generation.c |  8 
 src/backend/utils/mmgr/mcxt.c   |  5 -
 src/backend/utils/mmgr/slab.c   |  8 
 5 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/memdebug.h b/src/include/utils/memdebug.h
index e88b4c6e8ef..5988bff8839 100644
--- a/src/include/utils/memdebug.h
+++ b/src/include/utils/memdebug.h
@@ -23,6 +23,7 @@
 #define VALGRIND_CHECK_MEM_IS_DEFINED(addr, size)			do {} while (0)
 #define VALGRIND_CREATE_MEMPOOL(context, redzones, zeroed)	do {} while (0)
 #define VALGRIND_DESTROY_MEMPOOL(context)	do {} while (0)
+#define VALGRIND_MEMPOOL_TRIM(context, ptr, size)			do {} while (0)
 #define VALGRIND_MAKE_MEM_DEFINED(addr, size)do {} while (0)
 #define VALGRIND_MAKE_MEM_NOACCESS(addr, size)do {} while (0)
 #define VALGRIND_MAKE_MEM_UNDEFINED(addr, size)do {} while (0)
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index ec6c130d0fb..9793ddf4a3d 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -433,6 +433,12 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		{
 			/* Remove entry from freelist */
 			set = freelist->first_free;
+
+			VALGRIND_CREATE_MEMPOOL(set, 0, false);
+			VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext));
+			/* the contents are still valid, but valgrind can't know that */
+			VALGRIND_MAKE_MEM_DEFINED(set, sizeof(AllocSetContext));
+
 			freelist->first_free = (AllocSet) set->header.nextchild;
 			freelist->num_free--;

@@ -477,6 +483,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		   name)));
 	}

+	VALGRIND_CREATE_MEMPOOL(set, 0, false);
+	VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext));
 	/*
 	 * Avoid writing code that can fail between here and MemoryContextCreate;
 	 * we'd leak the header/initial block if we ereport in this stretch.
@@ -611,6 +619,8 @@ AllocSetReset(MemoryContext context)

 	Assert(context->mem_allocated == keepersize);

+	VALGRIND_MEMPOOL_TRIM(set, set, sizeof(AllocSetAlloc));
+
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
 }
@@ -652,6 +662,16 @@ A

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 13:31:14 -0400, Robert Haas wrote:
> On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar  wrote:
> > 0002:
> > - Wrapper over heap_form_tuple and used in ExecEvalRow() and
> > ExecEvalFieldStoreForm()
> 
> Instead of having heap_form_flattened_tuple(), how about
> heap_flatten_values(tupleDesc, values, isnull) that is documented to
> modify the values array? Then instead of replacing the
> heap_form_tuple() calls with a call to heap_form_flattened_tuple(),
> you just insert a call to heap_flatten_values() before the call to
> heap_form_tuple(). I think that might be easier for people looking at
> this code in the future to understand what's happening.

OTOH heap_form_flattened_tuple() has the advantage that we can optimize
it further (e.g. to do the conversion to flattened values in fill_val())
without changing the outside API.

Greetings,

Andres Freund

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-03-17 Thread Justin Pryzby

The v8 patch has the "broken constraint" problem.

Also, it "fails to avoid" adding duplicate constraints:

Check constraints:
"c" CHECK (i IS NOT NULL AND i > 1 AND i < 2)
"cc" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)
"p1_check" CHECK (true)
"p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)

> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index 5c9f4af1d5..0cb846f408 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -4485,6 +4485,16 @@ SCRAM-SHA-256$ count>:&l
> when using declarative partitioning.
>
>   
> +
> + 
> +  
> +   inhdetachpending bool
> +  
> +  
> +   Set to true for a partition that is in the process of being detached;
> +   false otherwise.
> +  
> + 

Remove "Set to" ?
And say true and false

Probably you'll hate the suggestion, but maybe it should be "pendingdetach".
We already have pg_settings.pending_restart.

> +  If CONCURRENTLY is specified, this process runs in 
> two
> +  transactions in order to avoid blocking other sessions that might be 
> accessing
> +  the partitioned table.  During the first transaction, a
> +  SHARE UPDATE EXCLUSIVE lock is taken on both parent 
> table and
> +  partition, and its partition is marked detached; at that point, the 
> transaction
> +  is committed and all transactions using the partitioned table are 
> waited for.
> +  Once all those transactions are gone, the second stage acquires

Instead of "gone", say "have completed" ?

> +/*
> + * MarkInheritDetached
> + *
> + * When a partition is detached from its parent concurrently, we don't
> + * remove the pg_inherits row until a second transaction; as a preparatory
> + * step, this function marks the entry as 'detached', so that other

*pending detached

> + * The strategy for concurrency is to first modify the partition catalog
> + * rows to make it visible to everyone that the partition is detached,

the inherits catalog?

> + /*
> +  * In concurrent mode, the partition is locked with 
> share-update-exclusive
> +  * in the first transaction.  This allows concurrent transactions to be
> +  * doing DML to the partition.

> + /*
> +  * Check inheritance conditions and either delete the pg_inherits row
> +  * (in non-concurrent mode) or just set the inhisdetached flag.

detachpending

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
 wrote:
>
> My concern is that the current behavior (where we prefer expression
> stats over multi-column stats to some extent) works fine as long as the
> parts are independent, but once there's dependency it's probably more
> likely to produce underestimates. I think underestimates for grouping
> estimates were a risk in the past, so let's not make that worse.
>

I'm not sure the current behaviour really is preferring expression
stats over multi-column stats. In this example, where we're grouping
by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
those multi-column stats actually match more than one
column/expression. If anything, I'd go the other way and say that it
was wrong to use the [(a+b),c] stats in the first case, where they
were the only stats available, since those stats aren't really
applicable to (c+d), which probably ought to be treated as
independent. IOW, it might have been better to estimate the first case
as

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

and the second case as

 ndistinct((a+b)) * ndistinct((c+d))

Regards,
Dean

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor

On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra 
wrote:
>
> Hi,
>
> Here is an updated version of the patch series.
>
> It fixes the assert failure (just remove the multiplication from it) and
> adds a simple regression test that exercises this.
>
> Based on the discussion so far, I've decided to keep just the new
> signature of the consistent function. That's a bit simpler than having
> to support both 3 and 4 parameters, and it would not deal with the NULL
> changes anyway (mostly harmless code duplication, but still). I've also
> realized the API documentation in SGML needs updating.
>
> At this point, I think 0001-0006 parts are mostly committable.

I think so. I've marked it RFC for this six.

> As for the remaining two parts, the one dealing with correlation may not
> be strictly necessary, but not having it (or something like it) may
> result in not picking the BRIN index in some cases.
>
> But maybe it's not a major problem. I tried the example from [1] but it
> no longer triggers the issue for me - I'm not entirely sure why, but the
> costing changed for some reason. It used to look like this:

> ...

> The index scan cost is about the same, but the heap scan is about half
> the cost. The row estimates are a bit different, perhaps that's related.
> The seqscan cost (169248) and duration (~500ms) is still about the same,
> but so now we still pick the bitmap heap scan.

With only 0001-0006, I get a parallel bitmap scan in both cases:

QUERY PLAN
---
 Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
time=3.283..22.308 rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
width=4) (actual time=3.434..17.257 rows=3 loops=3)
 Recheck Cond: (a = 1)
 Rows Removed by Index Recheck: 83165
 Heap Blocks: lossy=421
 ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42 rows=381682
width=0) (actual time=2.996..2.996 rows=11040 loops=1)
   Index Cond: (a = 1)
 Planning Time: 0.088 ms
 Execution Time: 22.341 ms
(11 rows)

> Not sure we can rely on
> this, though. Would be quite sad if we had new improved opclasses but
> the planner often decided not to use them.

Yeah, I'm not sure what to do here. It might be okay to leave it out now
and study it further as a PG14 open item or PG15 improvement.

> I had an idea of tweaking choose_bitmap_and to consider both the cost
> and selectivity (similarly to how add_path considers statup/total cost),
> and that did indeed resolve this particular case. This is what the 0008
> part does.
>
> But it may also have negative consequence, as demonstrated by the
> reproducer2.sql script. So maybe the logic would need to be more
> complicated. Or maybe there's no reliable solution, considering how
> tricky/unreliable BRIN estimates are.

Ok, so with 0008 in reproducer2, it chooses the more selective path, even
though it has a higher total cost:

0001-0007:

 QUERY PLAN
-
 Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
time=0.498..1.755 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 7167
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
(actual time=0.278..0.278 rows=1280 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.148 ms
 Execution Time: 1.774 ms
(8 rows)

DROP INDEX
QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=9.695..9.708 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=9.675..9.675 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.110 ms
 Execution Time: 9.727 ms
(8 rows)

and with 0008:

QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=8.540..8.577 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=8.507..8.508 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.175 ms
 Execution Time: 8.601 ms
(8 rows)

DROP INDEX

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra




On 3/17/21 7:54 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
>  wrote:
>>
>> My concern is that the current behavior (where we prefer expression
>> stats over multi-column stats to some extent) works fine as long as the
>> parts are independent, but once there's dependency it's probably more
>> likely to produce underestimates. I think underestimates for grouping
>> estimates were a risk in the past, so let's not make that worse.
>>
> 
> I'm not sure the current behaviour really is preferring expression
> stats over multi-column stats. In this example, where we're grouping
> by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
> those multi-column stats actually match more than one
> column/expression. If anything, I'd go the other way and say that it
> was wrong to use the [(a+b),c] stats in the first case, where they
> were the only stats available, since those stats aren't really
> applicable to (c+d), which probably ought to be treated as
> independent. IOW, it might have been better to estimate the first case
> as
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> and the second case as
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 

OK. I might be confused, but isn't that what the algorithm currently
does? Or am I just confused about what the first/second case refers to?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread Tomas Vondra




On 3/17/21 7:59 PM, John Naylor wrote:
> On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
>>
>> Hi,
>>
>> Here is an updated version of the patch series.
>>
>> It fixes the assert failure (just remove the multiplication from it) and
>> adds a simple regression test that exercises this.
>>
>> Based on the discussion so far, I've decided to keep just the new
>> signature of the consistent function. That's a bit simpler than having
>> to support both 3 and 4 parameters, and it would not deal with the NULL
>> changes anyway (mostly harmless code duplication, but still). I've also
>> realized the API documentation in SGML needs updating.
>>
>> At this point, I think 0001-0006 parts are mostly committable.
> 
> I think so. I've marked it RFC for this six.
> 
>> As for the remaining two parts, the one dealing with correlation may not
>> be strictly necessary, but not having it (or something like it) may
>> result in not picking the BRIN index in some cases.
>>
>> But maybe it's not a major problem. I tried the example from [1] but it
>> no longer triggers the issue for me - I'm not entirely sure why, but the
>> costing changed for some reason. It used to look like this:
> 
>> ...
> 
>> The index scan cost is about the same, but the heap scan is about half
>> the cost. The row estimates are a bit different, perhaps that's related.
>> The seqscan cost (169248) and duration (~500ms) is still about the same,
>> but so now we still pick the bitmap heap scan.
> 
> With only 0001-0006, I get a parallel bitmap scan in both cases:
> 
>                                                             QUERY PLAN
> ---
>  Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
> time=3.283..22.308 rows=10 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
> width=4) (actual time=3.434..17.257 rows=3 loops=3)
>          Recheck Cond: (a = 1)
>          Rows Removed by Index Recheck: 83165
>          Heap Blocks: lossy=421
>          ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42
> rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1)
>                Index Cond: (a = 1)
>  Planning Time: 0.088 ms
>  Execution Time: 22.341 ms
> (11 rows)
> 
>> Not sure we can rely on
>> this, though. Would be quite sad if we had new improved opclasses but
>> the planner often decided not to use them.
> 
> Yeah, I'm not sure what to do here. It might be okay to leave it out now
> and study it further as a PG14 open item or PG15 improvement.
> 

Yeah, that's definitely an option.

>> I had an idea of tweaking choose_bitmap_and to consider both the cost
>> and selectivity (similarly to how add_path considers statup/total cost),
>> and that did indeed resolve this particular case. This is what the 0008
>> part does.
>>
>> But it may also have negative consequence, as demonstrated by the
>> reproducer2.sql script. So maybe the logic would need to be more
>> complicated. Or maybe there's no reliable solution, considering how
>> tricky/unreliable BRIN estimates are.
> 
> Ok, so with 0008 in reproducer2, it chooses the more selective path,
> even though it has a higher total cost:
> 
> 0001-0007:
> 
>                                                      QUERY PLAN
> -
>  Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
> time=0.498..1.755 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 7167
>    Heap Blocks: lossy=128
>    ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
> (actual time=0.278..0.278 rows=1280 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.148 ms
>  Execution Time: 1.774 ms
> (8 rows)
> 
> DROP INDEX
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=9.695..9.708 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 223
>    Heap Blocks: lossy=4
>    ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
> (actual time=9.675..9.675 rows=40 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.110 ms
>  Execution Time: 9.727 ms
> (8 rows)
> 
> and with 0008:
> 
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=8.540..8.577 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by I

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 2:17 PM Andres Freund  wrote:
> OTOH heap_form_flattened_tuple() has the advantage that we can optimize
> it further (e.g. to do the conversion to flattened values in fill_val())
> without changing the outside API.

Well, in my view, that does change the outside API, because either the
input values[] array is going to get scribbled on, or it's not. We
should either decide we're not OK with it and just do the fill_val()
thing now, or we should decide that we are and not worry about doing
the fill_val() thing later. IMHO, anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

).On Mon, Mar 15, 2021 at 6:58 PM Andres Freund  wrote:
> - Adding all these indirect function calls via toast_compression[] just
>   for all of two builtin methods isn't fun either.

Yeah, it feels like this has too many layers of indirection now. Like,
toast_decompress_datum() first gets TOAST_COMPRESS_METHOD(attr). Then
it calls CompressionIdToMethod to convert one constant (like
TOAST_PGLZ_COMPRESSION_ID) to another constant with a slightly
different name (like TOAST_PGLZ_COMPRESSION). Then it calls
GetCompressionRoutines() to get hold of the function pointers. Then it
does an indirect functional call. That seemed like a pretty reasonable
idea when we were trying to support arbitrary compression AMs without
overly privileging the stuff that was built into core, but if we're
just doing stuff that's built into core, then we could just switch
(TOAST_COMPRESS_METHOD(attr)) and call the correct function. In fact,
we could even move the stuff from toast_compression.c into detoast.c,
which would allow the compiler to optimize better (e.g. by inlining,
if it wants).

The same applies to toast_decompress_datum_slice().

There's a similar issue in toast_get_compression_method() and the only
caller, pg_column_compression(). Here the multiple mapping layers and
the indirect function call are split across those two functions rather
than all in the same one, but here again one could presumably find a
place to just switch on TOAST_COMPRESS_METHOD(attr) or
VARATT_EXTERNAL_GET_COMPRESSION(attr) and return "pglz" or "lz4"
directly.

In toast_compress_datum(), I think we could have a switch that invokes
the appropriate compressor based on cmethod and sets a variable to the
value to be passed as the final argument of
TOAST_COMPRESS_SET_SIZE_AND_METHOD().

Likewise, I suppose CompressionNameToMethod could at least be
simplified to use constant strings rather than stuff like
toast_compression[TOAST_PGLZ_COMPRESSION_ID].cmname.

> - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
>   comparable to HIDE_TABLEAM?

Andres, what do you mean by this exactly? It's exactly the same issue:
without this, if you change the default compression method, every test
that uses \d+ breaks. If you want to be able to run the whole test
suite with either compression method and get the same results, you
need this. Now, maybe you don't, because perhaps that doesn't seem so
important with compression methods as with table AMs. I think that's a
defensible position. But, it is at the underlying level, the same
thing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> I originally suggested to make it clearer by having an enum GUC rather than a
> boolean, say compute_queryid = [ none | core | external ], and if set to
> external then a hook would be explicitely called.  Right now, "none" and
> "external" are binded with compute_queryid = off, and depends on whether an
> extension is computing a queryid during post_parse_analyse_hook.

I would just make it a Boolean and have a hook. The Boolean controls
whether it gets computed at all, and the hook lets an external module
override the way it gets computed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 19:07, Tomas Vondra
 wrote:
>
> On 3/17/21 7:54 PM, Dean Rasheed wrote:
> >
> > it might have been better to estimate the first case as
> >
> >  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> >
> > and the second case as
> >
> >  ndistinct((a+b)) * ndistinct((c+d))
>
> OK. I might be confused, but isn't that what the algorithm currently
> does? Or am I just confused about what the first/second case refers to?
>

No, it currently estimates the first case as ndistinct((a+b),c) *
ndistinct(d). Having said that, maybe that's OK after all. It at least
makes an effort to account for any correlation between (a+b) and
(c+d), using the known correlation between (a+b) and c. For reference,
here is the test case I was using (which isn't really very good for
catching dependence between columns):

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 10
SELECT COUNT(DISTINCT b) FROM foo; -- 11
SELECT COUNT(DISTINCT c) FROM foo; -- 12
SELECT COUNT(DISTINCT d) FROM foo; -- 13
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 2964, Actual = 478
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 480, Actual = 478
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24

I think that's probably pretty reasonable behaviour, given incomplete
stats (the estimate with no extended stats is capped at 1).

Regards,
Dean

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-03-17 Thread Simon Riggs

On Fri, 12 Mar 2021 at 22:16, Tomas Vondra
 wrote:
>
> On 1/28/21 2:33 PM, Simon Riggs wrote:
> > On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada  wrote:
> >
> >> This entry has been "Waiting on Author" status and the patch has not
> >> been updated since Nov 30. Are you still planning to work on this?
> >
> > Yes, new patch version tomorrow. Thanks for the nudge and the review.
> >
>
> So, is it tomorrow already? ;-)

Been a long day. ;-)

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: non-HOT update not looking at FSM for large tuple update

2021-03-17 Thread John Naylor

On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:
>
> If this case isn't added, the lower else branch will fail to find
> fitting pages for len > maxPaddedFsmRequest tuples; potentially
> extending the relation when there is actually still enough space
> available.

Okay, with that it looks better to go back to using Max().

Also in v4:

- With the separate constant you suggested, I split up the comment into two
parts.
- I've added a regression test to insert.sql similar to your earlier test,
but we cannot use vacuum, since in parallel tests there could still be
tuples visible to other transactions. It's still possible to test
almost-all-free by inserting a small tuple.
- Draft commit message

--
John Naylor
EDB: http://www.enterprisedb.com


v4-0001-Fix-bug-in-heap-space-management-that-was-overly-.patch
Description: Binary data

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor

On Wed, Mar 17, 2021 at 3:16 PM Tomas Vondra 
wrote:

> Ummm, no attachment ;-)

Oops, here it is.

--
John Naylor
EDB: http://www.enterprisedb.com


jcn-costing-test.sql
Description: Binary data

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>
> For reference, here is the test case I was using (which isn't really very 
> good for
> catching dependence between columns):
>

And here's a test case with much more dependence between the columns:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 2
SELECT COUNT(DISTINCT b) FROM foo; -- 5
SELECT COUNT(DISTINCT c) FROM foo; -- 10
SELECT COUNT(DISTINCT d) FROM foo; -- 15
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 150, Actual = 30
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
  -- which is much better than ndistinct((a+b)) * ndistinct(c) *
ndistinct(d) = 6*10*15 = 900
  -- Estimate with no stats = 1500

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 120, Actual = 30
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20

Again, I'd say the current behaviour is pretty good.

Regards,
Dean

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra




On 3/17/21 9:58 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>>
>> For reference, here is the test case I was using (which isn't really very 
>> good for
>> catching dependence between columns):
>>
> 
> And here's a test case with much more dependence between the columns:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int, b int, c int, d int);
> INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
> SELECT COUNT(DISTINCT a) FROM foo; -- 2
> SELECT COUNT(DISTINCT b) FROM foo; -- 5
> SELECT COUNT(DISTINCT c) FROM foo; -- 10
> SELECT COUNT(DISTINCT d) FROM foo; -- 15
> SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
> SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
> SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
> SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30
> 
> -- First case: stats on [(a+b),c]
> CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 150, Actual = 30
>   -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
>   -- which is much better than ndistinct((a+b)) * ndistinct(c) *
> ndistinct(d) = 6*10*15 = 900
>   -- Estimate with no stats = 1500
> 
> -- Second case: stats on (c+d) as well
> CREATE STATISTICS s2 ON (c+d) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 120, Actual = 30
>   -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20
> 
> Again, I'd say the current behaviour is pretty good.
> 

Thanks!

I agree applying at least the [(a+b),c] stats is probably the right
approach, as it means we're considering at least the available
information about dependence between the columns.

I think to improve this, we'll need to teach the code to use overlapping
statistics, a bit like conditional probability. In this case we might do
something like this:

ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Which in this case would be either, for the "less correlated" case

228 * 24 / 12 = 446   (actual = 478, current estimate = 480)

or, for the "more correlated" case

10 * 20 / 10 = 20 (actual = 30, current estimate = 120)

But that's clearly a matter for a future patch, and I'm sure there are
cases where this will produce worse estimates.


Anyway, I plan to go over the patches one more time, and start pushing
them sometime early next week. I don't want to leave it until the very
last moment in the CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> Right, I was just going to point out the FPIs are not necessary - what
> matters is the presence of long streaks of WAL records touching the same
> set of blocks. But people with workloads where this is common likely
> don't need the WAL prefetching at all - the replica can keep up just
> fine, because it doesn't need to do much I/O anyway (and if it can't
> then prefetching won't help much anyway). So just don't enable the
> prefetching, and there'll be no overhead.

Isn't this exactly the common case though..?  Checkpoints happening
every 5 minutes, the replay of the FPI happens first and then the record
is updated and everything's in SB for the later changes?  You mentioned
elsewhere that this would improve 80% of cases but that doesn't seem to
be backed up by anything and certainly doesn't seem likely to be the
case if we're talking about across all PG deployments.  I also disagree
that asking the kernel to go do random I/O for us, even as a prefetch,
is entirely free simply because we won't actually need those pages.  At
the least, it potentially pushes out pages that we might need shortly
from the filesystem cache, no?

> If it was up to me, I'd just get the patch committed as is. Delaying the
> feature because of concerns that it might have some negative effect in
> some cases, when that can be simply mitigated by disabling the feature,
> is not really beneficial for our users.

I don't know that we actually know how many cases it might have a
negative effect on or what the actual amount of such negative case there
might be- that's really why we should probably try to actually benchmark
it and get real numbers behind it, particularly when the chances of
running into such a negative effect with the default configuration (that
is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
likely to occur in the field than the cases where FPWs are disabled and
someone's running on ZFS.

Perhaps more to the point, it'd be nice to see how this change actually
improves the caes where PG is running with more-or-less the defaults on
the more commonly deployed filesystems.  If it doesn't then maybe it
shouldn't be the default..?  Surely the folks running on ZFS and running
with FPWs disabled would be able to manage to enable it if they
wished to and we could avoid entirely the question of if this has a
negative impact on the more common cases.

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Anyhow, ultimately I don't know that there's much more to discuss on
this thread with regard to this particular topic, at least.  As I said
before, if everyone else is on board and not worried about it then so be
it; I feel that at least the concern that I raised has been heard.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 16:01:58 -0400, Robert Haas wrote:
> > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
> >   comparable to HIDE_TABLEAM?
> 
> Andres, what do you mean by this exactly? It's exactly the same issue:
> without this, if you change the default compression method, every test
> that uses \d+ breaks. If you want to be able to run the whole test
> suite with either compression method and get the same results, you
> need this. Now, maybe you don't, because perhaps that doesn't seem so
> important with compression methods as with table AMs.

I think that latter part is why I wasn't sure such an option is
warranted. Given it's a builtin feature, I didn't really forsee a need
to be able to run all the tests with a different compression method. And
it looked a like it could just have been copied from the tableam logic,
without a clear need. But if it's useful, then ...

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
>> I found a way around that late last night. Need to mark the context
>> itself as an allocation. But I made a mess on the way to that and need
>> to clean the patch up before sending it (and need to drop my
>> girlfriend off first).

> Unfortunately I didn't immediately find a way to do this while keeping
> the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
> creation into the memory context implementations, "allocates" the
> context itself as part of that pool, and changes the reset
> implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
> MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
> tracking ->ident etc), but marks all the other memory as freed.

Huh, interesting.  I wonder why that makes the ident problem go away?
I'd supposed that valgrind would see the context headers as ordinary
memory belonging to the global "malloc" pool, so that any pointers
inside them ought to be considered valid.

Anyway, I don't have a problem with rearranging the responsibility
like this.  It gives the individual allocators more freedom to do
odd stuff, at the cost of very minor duplication of valgrind calls.

I agree we need more comments -- would you like me to have a go at
writing them?

One thing I was stewing over last night is that a MemoryContextReset
will mess up any context identifier assigned with
MemoryContextCopyAndSetIdentifier.  I'd left that as a problem to
fix later, because we don't currently have a need to reset contexts
that use copied identifiers.  But that assumption obviously will bite
us someday, so maybe now is a good time to think about it.

The very simplest fix would be to allocate non-constant idents with
malloc; which'd require adding a flag to track whether context->ident
needs to be free()d.  We have room for another bool near the top of
struct MemoryContextData (and at some point we could turn those
bool fields into a flags word).  The only real cost here is one
more free() while destroying a labeled context, which is probably
negligible.

Other ideas are possible but they seem to require getting the individual
mcxt methods involved, and I doubt it's worth the complexity.

> There are a few leak warnings around guc.c that look like they might be
> real, not false positives, and thus a bit concerning. Looks like several
> guc check hooks don't bother to free the old *extra before allocating a
> new one.

I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
responsible for freeing those.  Might be another case of valgrind not
understanding what's happening.

> I suspect we might get better results from valgrind, not just for leaks
> but also undefined value tracking, if we changed the way we represent
> pools to utilize VALGRIND_MEMPOOL_METAPOOL |
> VALGRIND_MEMPOOL_AUTO_FREE.

Don't really see why that'd help?  I mean, it could conceivably help
catch bugs in the allocators themselves, but I don't follow the argument
that it'd improve anything else.  Defined is defined, as far as I can
tell from the valgrind manual.

regards, tom lane

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Hannu Krosing

why are you using yet another special syntax for this ?

would it not be better to do something like this:
CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int)
LANGUAGE plpgsql AS $plpgsql$
DECLARE
   args function_name_alias
BEGIN
  args.o = 2 * args.i
END;
$plpgsql$;

or at least do something based on block labels (see end of
https://www.postgresql.org/docs/13/plpgsql-structure.html )

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< outerblock >>
DECLARE
...

maybe extend this to

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< functionlabel:args >>
<< outerblock >>
DECLARE
...

or replace 'functionlabel' with something less ugly :)

maybe <<< args >>>

Cheers
Hannu



On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule  wrote:
>
>
>
> st 17. 3. 2021 v 9:20 odesílatel Michael Paquier  napsal:
>>
>> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
>> > I also think that there should be a single usable top label, otherwise it 
>> > will
>> > lead to confusing code and it can be a source of bug.
>>
>> Okay, that's fine by me.  Could it be possible to come up with an
>> approach that does not hijack the namespace list contruction and the
>> lookup logic as much as it does now?  I get it that the patch is done
>> this way because of the ordering of operations done for the initial ns
>> list creation and the grammar parsing that adds the routine label on
>> top of the root one, but I'd like to believe that there are more solid
>> ways to achieve that, like for example something that decouples the
>> root label and its alias, or something that associates an alias
>> directly with its PLpgSQL_nsitem entry?
>
>
> I am checking it now, and I don't see any easy solution. The namespace is a 
> one direction tree - only backward iteration from leafs to roots is 
> supported. At the moment, when I try to replace the name of root ns_item, 
> this root ns_item is referenced by the function's arguments (NS_TYPE_VAR 
> type). So anytime I need some  helper ns_item node, that can be a persistent 
> target for these nodes. In this case is a memory overhead of just one ns_item.
>
> orig_label <- argument1 <- argument2
>
> The patch has to save the original orig_label because it can be referenced 
> from argument1 or by "FOUND" variable. New graphs looks like
>
> new_label <- invalidated orig_label <- argument1 <- ...
>
> This tree has a different direction than is usual, and then replacing the 
> root node is not simple.
>
> Second solution (significantly more simple) is an additional pointer in 
> ns_item structure. In almost all cases this pointer will not be used. Because 
> ns_item uses a flexible array, then union cannot be used. I implemented this 
> in a patch marked as "alias-implementation".
>
> What do you think about it?
>
> Pavel
>
>
>
>
>
>
>>
>> --
>> Michael

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote:
> On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> > I originally suggested to make it clearer by having an enum GUC rather than 
> > a
> > boolean, say compute_queryid = [ none | core | external ], and if set to
> > external then a hook would be explicitely called.  Right now, "none" and
> > "external" are binded with compute_queryid = off, and depends on whether an
> > extension is computing a queryid during post_parse_analyse_hook.
> 
> I would just make it a Boolean and have a hook. The Boolean controls
> whether it gets computed at all, and the hook lets an external module
> override the way it gets computed.

OK, is that what everyone wants?  I think that is what the patch already
does.

I think having multiple queryids used in a single cluster is much too
confusing to support.  You would have to label and control which queryid
is displayed by pg_stat_activity and log_line_prefix, and that seems too
confusing and not useful.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 05:04:48PM +0100, Pavel Stehule wrote:
> This tree has a different direction than is usual, and then replacing the
> root node is not simple.

Yeah, it is not like we should redesign this whole part just for the
feature discussed here, and that may impact performance as the current
list handling is cheap now.

> Second solution (significantly more simple) is an additional pointer in
> ns_item structure. In almost all cases this pointer will not be used.
> Because ns_item uses a flexible array, then union cannot be used. I
> implemented this in a patch marked as "alias-implementation".
> 
> What do you think about it?

I am not sure that it is necessary nor good to replace entirely the
root label.  So associating a new field directly into it sounds like a
promising approach?
--
Michael


signature.asc
Description: PGP signature

Re: Calendar support in localization

2021-03-17 Thread Thomas Munro

On Thu, Mar 18, 2021 at 3:48 AM Tom Lane  wrote:
> Robert Haas  writes:
> > It's not very obvious how to scale this kind of approach to a wide
> > variety of calendar types, and as Thomas says, it would much cooler to
> > be able to handle all of the ones that ICU knows how to support rather
> > than just one. But, the problem I see with using timestamptz is that
> > it's not so obvious how to get a different output format ... unless, I
> > guess, we can cram it into DateStyle. And it's also much less obvious
> > how you get the other functions and operators to do what you want, if
> > it's different.
>
> Yeah, I'm afraid that it probably is different.  The most obvious
> example is in operations involving type interval:
> select now() + '1 month'::interval;
> That should almost certainly give a different answer when using a
> different calendar --- indeed the units of interest might not even
> be the same.  (Do all human calendars use the concept of months?)

Right, so if this is done by trying to extend Daniel Verite's icu_ext
extension (link given earlier) and Robert's idea of a fast-castable
type, I suppose you might want now()::icu_date + '1 month'::internal
to advance you by one Ethiopic month if you have done SET
icu_ext.ICU_LC_TIME = 'am_ET@calendar=traditional'.  Or if using my
first idea of just sticking with the core types, perhaps you'd have to
replace stuff via search path... I admit that sounds rather error
prone and fragile (I was thinking mainly of different functions, not
operators).  Either way, I suppose there'd also be more explicit
functions for various operations including ones that take an extra
argument if you want to use an explicit locale instead of relying on
the ICU_LC_TIME setting.  I dunno.

As for whether all calendars have months, it looks like ICU's model
has just the familiar looking standardised fields; whether some of
them make no sense in some calendars, I don't know, but it has stuff
like x.get(field, &error), x.set(field, &error), x.add(field, amount,
&error) and if it fails for some field on your particular calendar, or
for some value (you can't set a Gregorian date's month to 13
(apparently we call this month "undecember", hah), but you can for a
Hebrew or Ethiopic one) I suppose we'd just report the error?

> I don't feel like DateStyle is chartered to affect the behavior
> of datetime operators; it's understood as tweaking the I/O behavior
> only.  There might be more of a case for letting LC_TIME choose
> this behavior, but I bet the relevant standards only contemplate

About LC_TIME... I suppose in one possible future we eventually use
ICU for more core stuff, and someone proposes to merge hypothetical
icu_date etc types into the core date etc types, and then LC_TIME
controls that.  But then you might have a version of the problem that
Peter E ran into in attempts so far to use ICU collations as the
default: if you put ICU's funky extensible locale names into the
LC_XXX environment variables, then your libc will see it too, and
might get upset, since PostgreSQL uses the en.  I suspect that ICU
will understand typical libc locale names, but common libcs won't
understand ICU's highly customisable syntax, but I haven't looked into
it.  If that's generally true, then perhaps the solution to both
problems is a kind of partial separation:  regular LC_XXX, and then
also ICU_LC_XXX which defaults to the same value but can be changed to
access more advanced stuff, and is used only for interacting with ICU.

> Gregorian calendars.  Also, the SQL spec says in so many words
> that the SQL-defined datetime types follow the Gregorian calendar.

:-(

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra

Hi,

On 3/17/21 10:43 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> Right, I was just going to point out the FPIs are not necessary - what
>> matters is the presence of long streaks of WAL records touching the same
>> set of blocks. But people with workloads where this is common likely
>> don't need the WAL prefetching at all - the replica can keep up just
>> fine, because it doesn't need to do much I/O anyway (and if it can't
>> then prefetching won't help much anyway). So just don't enable the
>> prefetching, and there'll be no overhead.
> 
> Isn't this exactly the common case though..?  Checkpoints happening
> every 5 minutes, the replay of the FPI happens first and then the record
> is updated and everything's in SB for the later changes?

Well, as I said before, the FPIs are not very significant - you'll have
mostly the same issue with any repeated changes to the same block. It
does not matter much if you do

FPI for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

or just

WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

In both cases some of the prefetches are probably unnecessary. But the
frequency of checkpoints does not really matter, the important bit is
repeated changes to the same block(s).

If you have active set much larger than RAM, this is quite unlikely. And
we know from the pgbench tests that prefetching has a huge positive
effect in this case.

On smaller active sets, with frequent updates to the same block, we may
issue unnecessary prefetches - that's true. But (a) you have not shown
any numbers suggesting this is actually an issue, and (b) those cases
don't really need prefetching because all the data is already either in
shared buffers or in page cache. So if it happens to be an issue, the
user can simply disable it.

So how exactly would a problematic workload look like?

> You mentioned elsewhere that this would improve 80% of cases but that
> doesn't seem to be backed up by anything and certainly doesn't seem
> likely to be the case if we're talking about across all PG
> deployments.

Obviously, the 80% was just a figure of speech, illustrating my belief
that the proposed patch is beneficial for most users who currently have
issues with replication lag. That is based on my experience with support
customers who have such issues - it's almost invariably an OLTP workload
with large active set, and we know (from the benchmarks) that in these
cases it helps.

Users who don't have issues with replication lag can disable (or not
enable) the prefetching, and won't get any negative effects.

Perhaps there are users with weird workloads that have replication lag
issues but this patch won't help them - bummer, we can't solve
everything in one go. Also, no one actually demonstrated such workload
in this thread so far.

But as you're suggesting we don't have data to support the claim that
this actually helps many users (with no risk to others), I'd point out
you have not actually provided any numbers showing that it actually is
an issue in practice.

> I also disagree that asking the kernel to go do random I/O for us, 
> even as a prefetch, is entirely free simply because we won't
> actually need those pages.  At the least, it potentially pushes out
> pages that we might need shortly from the filesystem cache, no?
Where exactly did I say it's free? I said that workloads where this
happens a lot most likely don't need the prefetching at all, so it can
be simply disabled, eliminating all negative effects.

Moreover, looking at a limited number of recently prefetched blocks
won't eliminate this problem anyway - imagine a random OLTP on large
data set that however fits into RAM. After a while no read I/O needs to
be done, but you'd need pretty much infinite list of prefetched blocks
to eliminate that, and with smaller lists you'll still do 99% of the
prefetches.

Just disabling prefetching on such instances seems quite reasonable.

>> If it was up to me, I'd just get the patch committed as is. Delaying the
>> feature because of concerns that it might have some negative effect in
>> some cases, when that can be simply mitigated by disabling the feature,
>> is not really beneficial for our users.
> 
> I don't know that we actually know how many cases it might have a
> negative effect on or what the actual amount of such negative case there
> might be- that's really why we should probably try to actually benchmark
> it and get real numbers behind it, particularly when the chances of
> running into such a negative effect with the default configuration (that
> is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
> likely to occur in the field than the cases where FPWs are disabled and
> someone's running on ZFS.
> 
> Perhaps more to the point, it'd be nice to see how this

replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

in the course of https://postgr.es/m/3471359.1615937770%40sss.pgh.pa.us
I saw a leak in pgstat_read_statsfiles(), more precisely:
/* Allocate the space for replication slot statistics */
replSlotStats = palloc0(max_replication_slots * 
sizeof(PgStat_ReplSlotStats));

the issue is that the current memory context is not set by
pgstat_read_statsfiles().

In some cases CurrentMemoryContext is going to be a long-lived context,
accumulating those allocations over time. In other contexts it will be a
too short lived context, e.g. an ExprContext from the pg_stat_*
invocation in the query. A reproducer for the latter:

postgres[2252294][1]=# SELECT pg_create_logical_replication_slot('test', 
'test_decoding');
┌┐
│ pg_create_logical_replication_slot │
├┤
│ (test,0/456C1878)  │
└┘
(1 row)

postgres[2252294][1]=# BEGIN ;
BEGIN

postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ;
┌───┬┬─┬─┬─┬──┬──┬─┐
│ slot_name │ spill_txns │ spill_count │ spill_bytes │ stream_txns │ 
stream_count │ stream_bytes │ stats_reset │
├───┼┼─┼─┼─┼──┼──┼─┤
│ test  │  0 │   0 │   0 │   0 │
0 │0 │ (null)  │
└───┴┴─┴─┴─┴──┴──┴─┘
(1 row)

postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ;
┌─>
│   
  >
├─>
│ 
\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F>
└─>
(1 row)

I'll push the minimal fix of forcing the allocation to happen in
pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().


But it seems like we just shouldn't allocate it dynamically at all?
max_replication_slots doesn't change during postmaster lifetime, so it
seems like it should just be allocated once?

Greetings,

Andres Freund

Re: replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 16:04:47 -0700, Andres Freund wrote:
> I'll push the minimal fix of forcing the allocation to happen in
> pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().

Done: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211

Greetings,

Andres Freund

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Julien Rouhaud

On Wed, Mar 17, 2021 at 06:32:16PM -0400, Bruce Momjian wrote:
> On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote:
> > On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> > > I originally suggested to make it clearer by having an enum GUC rather 
> > > than a
> > > boolean, say compute_queryid = [ none | core | external ], and if set to
> > > external then a hook would be explicitely called.  Right now, "none" and
> > > "external" are binded with compute_queryid = off, and depends on whether 
> > > an
> > > extension is computing a queryid during post_parse_analyse_hook.
> > 
> > I would just make it a Boolean and have a hook. The Boolean controls
> > whether it gets computed at all, and the hook lets an external module
> > override the way it gets computed.
> 
> OK, is that what everyone wants?  I think that is what the patch already
> does.

Note exactly.  Right now a custom queryid can be computed even if
compute_queryid is off, if some extension does that in post_parse_analyze_hook.

I'm assuming that what Robert was thinking was more like:

if (compute_queryid)
{
if (queryid_hook)
queryId = queryid_hook(...);
else
queryId = JumbeQuery(...);
}
else
queryId = 0;

And that should be done *after* post_parse_analyse_hook so that it's clear that
this hook is no longer the place to compute queryid.

Is that what should be done?

> I think having multiple queryids used in a single cluster is much too
> confusing to support.  You would have to label and control which queryid
> is displayed by pg_stat_activity and log_line_prefix, and that seems too
> confusing and not useful.

I agree.

Re: replication slot stats memory bug

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> I saw a leak in pgstat_read_statsfiles(), more precisely:
>   /* Allocate the space for replication slot statistics */
>   replSlotStats = palloc0(max_replication_slots * 
> sizeof(PgStat_ReplSlotStats));

Yeah, I just found that myself.  I think your fix is good.

> But it seems like we just shouldn't allocate it dynamically at all?
> max_replication_slots doesn't change during postmaster lifetime, so it
> seems like it should just be allocated once?

Meh.  I don't see a need to wire in such an assumption here.

regards, tom lane

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> I found a way around that late last night. Need to mark the context
> >> itself as an allocation. But I made a mess on the way to that and need
> >> to clean the patch up before sending it (and need to drop my
> >> girlfriend off first).
> 
> > Unfortunately I didn't immediately find a way to do this while keeping
> > the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
> > creation into the memory context implementations, "allocates" the
> > context itself as part of that pool, and changes the reset
> > implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
> > MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
> > tracking ->ident etc), but marks all the other memory as freed.
> 
> Huh, interesting.  I wonder why that makes the ident problem go away?
> I'd supposed that valgrind would see the context headers as ordinary
> memory belonging to the global "malloc" pool, so that any pointers
> inside them ought to be considered valid.

I'm didn't quite understand either at the time of writing the change. It
just seemed the only explanation for some the behaviour I was seeing, so
I tried it. Just to be initially be rebuffed due to errors when
accessing the recycled sets...

I spent a bit of time looking at valgrind, and it looks to be explicit
behaviour:

memcheck/mc_leakcheck.c

static MC_Chunk**
find_active_chunks(Int* pn_chunks)
{
   // Our goal is to construct a set of chunks that includes every
   // mempool chunk, and every malloc region that *doesn't* contain a
   // mempool chunk.
...
   // Then we loop over the mempool tables. For each chunk in each
   // pool, we set the entry in the Bool array corresponding to the
   // malloc chunk containing the mempool chunk.
   VG_(HT_ResetIter)(MC_(mempool_list));
   while ( (mp = VG_(HT_Next)(MC_(mempool_list))) ) {
  VG_(HT_ResetIter)(mp->chunks);
  while ( (mc = VG_(HT_Next)(mp->chunks)) ) {

 // We'll need to record this chunk.
 n_chunks++;

 // Possibly invalidate the malloc holding the beginning of this chunk.
 m = find_chunk_for(mc->data, mallocs, n_mallocs);
 if (m != -1 && malloc_chunk_holds_a_pool_chunk[m] == False) {
tl_assert(n_chunks > 0);
n_chunks--;
malloc_chunk_holds_a_pool_chunk[m] = True;
 }

I think that means it explicitly ignores the entire malloced allocation
whenever there's *any* mempool chunk in it, instead considering only the
mempool chunks. So once aset allocats something in the init block, the
context itself is ignored for leak checking purposes. But that wouldn't
be the case if we didn't have the init block...

I guess that makes sense, but would definitely be nice to have had
documented...

> Anyway, I don't have a problem with rearranging the responsibility
> like this.  It gives the individual allocators more freedom to do
> odd stuff, at the cost of very minor duplication of valgrind calls.

Yea, similar.

> I agree we need more comments -- would you like me to have a go at
> writing them?

Gladly!

> One thing I was stewing over last night is that a MemoryContextReset
> will mess up any context identifier assigned with
> MemoryContextCopyAndSetIdentifier.

Valgrind should catch such problems. Not having the danger would be
better, of course.

We could also add an assertion at reset time that the identifier isn't
in the current context.

> The very simplest fix would be to allocate non-constant idents with
> malloc; which'd require adding a flag to track whether context->ident
> needs to be free()d.  We have room for another bool near the top of
> struct MemoryContextData (and at some point we could turn those
> bool fields into a flags word).  The only real cost here is one
> more free() while destroying a labeled context, which is probably
> negligible.

Hm. A separate malloc()/free() could conceivably actually show up in
profiles at some point.

What if we instead used that flag to indicate that MemoryContextReset()
needs to save the identifier? Then any cost would only be paid if the
context is actually reset.

> > There are a few leak warnings around guc.c that look like they might be
> > real, not false positives, and thus a bit concerning. Looks like several
> > guc check hooks don't bother to free the old *extra before allocating a
> > new one.
> 
> I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
> responsible for freeing those.

Yea, I had misattributed the leak to the callbacks. One of the things I
saw definitely is a leak: if call_string_check_hook() ereport(ERRORs)
the guc_strdup() of newval->stringval is lost.

There's another set of them that seems to be related to paralellism. But
I've not hunted it down yet.

I think it might be worth adding a VALGRIND_DO_CHANGED_LEAK_CHECK() at
the end of a transaction or such? That way it'd not be quite as hard to
pinpoint the so

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Michael Paquier

On Tue, Mar 16, 2021 at 11:40:12AM +0100, Magnus Hagander wrote:
> If we can provide a new .EXE built with exactly the same flags as the
> EDB downloads that they can just drop into a directory, I think it's a
> lot easier to get that done.

Yeah, multiple people have been complaining about that bug, so I have
just produced two builds that people with those sensitive environments
can use, and sent some private links to get the builds.  Let's see how
it goes from this point, but, FWIW, I have not been able to reproduce
again my similar problem with the archive command :/
--
Michael


signature.asc
Description: PGP signature

Re: Calendar support in localization

2021-03-17 Thread Vik Fearing

On 3/17/21 3:48 PM, Tom Lane wrote:
> Also, the SQL spec says in so many words
> that the SQL-defined datetime types follow the Gregorian calendar.

We already don't follow the SQL spec for timestamps (and I, for one,
think we are better for it) so I don't think we should worry about that.
-- 
Vik Fearing

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
>> Huh, interesting.  I wonder why that makes the ident problem go away?

> I spent a bit of time looking at valgrind, and it looks to be explicit
> behaviour:
>
>// Our goal is to construct a set of chunks that includes every
>// mempool chunk, and every malloc region that *doesn't* contain a
>// mempool chunk.

Ugh.

> I guess that makes sense, but would definitely be nice to have had
> documented...

Indeed.  So this started happening to us when we merged the aset
header with its first allocation block.

>>> There are a few leak warnings around guc.c that look like they might be
>>> real, not false positives, and thus a bit concerning. Looks like several
>>> guc check hooks don't bother to free the old *extra before allocating a
>>> new one.

>> I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
>> responsible for freeing those.

I believe I've just tracked down the cause of that.  Those errors
are (as far as I've seen) only happening in parallel workers, and
the reason is this gem in RestoreGUCState:

/* See comment at can_skip_gucvar(). */
for (i = 0; i < num_guc_variables; i++)
if (!can_skip_gucvar(guc_variables[i]))
InitializeOneGUCOption(guc_variables[i]);

where InitializeOneGUCOption zeroes out that GUC variable, causing
it to lose track of any allocations it was responsible for.

At minimum, somebody didn't understand the difference between "initialize"
and "reset".  But I suspect we could just nuke this loop altogether.
I've not yet tried to grok the comment that purports to justify it,
but I fail to see why it'd ever be useful to drop values inherited
from the postmaster.

regards, tom lane

Re: replication slot stats memory bug

2021-03-17 Thread Amit Kapila

On Thu, Mar 18, 2021 at 4:55 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-03-17 16:04:47 -0700, Andres Freund wrote:
> > I'll push the minimal fix of forcing the allocation to happen in
> > pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().
>
> Done: 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211
>

Thank you!

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Justin Pryzby

On Wed, Mar 17, 2021 at 02:50:34PM -0700, Andres Freund wrote:
> On 2021-03-17 16:01:58 -0400, Robert Haas wrote:
> > > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
> > >   comparable to HIDE_TABLEAM?
> > 
> > Andres, what do you mean by this exactly? It's exactly the same issue:
> > without this, if you change the default compression method, every test
> > that uses \d+ breaks. If you want to be able to run the whole test
> > suite with either compression method and get the same results, you
> > need this. Now, maybe you don't, because perhaps that doesn't seem so
> > important with compression methods as with table AMs.

Arguably, it's more important, since it affects every column in \d+, not just a
"footer" line.

> I think that latter part is why I wasn't sure such an option is
> warranted. Given it's a builtin feature, I didn't really forsee a need
> to be able to run all the tests with a different compression method. And
> it looked a like it could just have been copied from the tableam logic,
> without a clear need. But if it's useful, then ...

This was one of my suggestions and contributions.
I copied it from tableam specifically, not incidentally.
https://www.postgresql.org/message-id/20210214184940.GL1793%40telsasoft.com

-- 
Justin

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-16 16:20:37 +0900, Michael Paquier wrote:
> Fujii-san has mentioned that on twitter, but one area that has changed
> during the v13 cycle is aaa3aed, where the code recycling segments has
> been switched from a pgrename() (with a retry loop) to a
> CreateHardLinkA()+pgunlink() (with a retry loop for the second).  One
> theory that I got in mind here is the case where we create the hard
> link, but fail to finish do the pgunlink() on the xlogtemp.N file,
> though after some testing it did not seem to have any impact.

A related question: What on earth is the point of using the unlink
approach on any operating system. We use the durable_rename_excl() (and
its predecessor, durable_link_or_rename(), and in turn its open coded
predecessors) for things like recycling WAL files at check points.

Now imagine that durable_rename_excl() fails to unlink the old
file. We'll still have the old file, but there's a second link to a new
WAL file, which will be used. No error will be thrown, because we don't
check unlink()'s return code (but if we did, we'd still have similar
issues).

And then imagine that that happens again, during the next checkpoint,
because the permission or whatever issue is not yet resolved. We now
will have the same physical file in two location in the future WAL
stream.

Welcome impossible to debug issues.

And all of this with the sole justification of "paranoidly trying to
avoid overwriting an existing file (there shouldn't be one).". A few
lines after we either unlinked the target filename, or used stat() to
find an unused filename.

Isn't the whole idea of durable_rename_excl() bad? There's not the same
danger of using it when we start from a temp filename, e.g. during
creation of new segments, or timeline files or whatnot. But I still
don't see what the whole thing is supposed to protect us against
realistically.

Greetings,

Andres Freund

Re: replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 19:36:46 -0400, Tom Lane wrote:
> > But it seems like we just shouldn't allocate it dynamically at all?
> > max_replication_slots doesn't change during postmaster lifetime, so it
> > seems like it should just be allocated once?
> 
> Meh.  I don't see a need to wire in such an assumption here.

It does make it easier for the shared memory stats patch, because if
there's a fixed number + location, the relevant stats reporting doesn't
need to go through a hashtable with the associated locking.  I guess
that may have colored my perception that it's better to just have a
statically sized memory allocation for this.  Noteworthy that SLRU stats
are done in a fixed size allocation as well...

Greetings,

Andres Freund

Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

2021-03-17 Thread Japin Li




On Tue, 16 Mar 2021 at 20:13, Bharath Rupireddy 
 wrote:
> On Tue, Mar 16, 2021 at 1:15 AM Tom Lane  wrote:
>>
>> [ Sorry for not looking at this thread sooner ]
>>
>> Bharath Rupireddy  writes:
>> > Currently, $subject is not allowed. We do plan the mat view query
>> > before every refresh. I propose to show the explain/explain analyze of
>> > the select part of the mat view in case of Refresh Mat View(RMV).
>>
>> TBH, I think we should reject this.  The problem with it is that it
>> binds us to the assumption that REFRESH MATERIALIZED VIEW has an
>> explainable plan.  There are various people poking at ideas like
>> incremental matview updates, which might rely on some implementation
>> that doesn't exactly equate to a SQL query.  Incremental updates are
>> hard enough already; they'll be even harder if they also have to
>> maintain compatibility with a pre-existing EXPLAIN behavior.
>>
>> I don't really see that this feature buys us anything you can't
>> get by explaining the view's query, so I think we're better advised
>> to keep our options open about how REFRESH might be implemented
>> in future.
>
> That makes sense to me. Thanks for the comments. I'm fine to withdraw the 
> patch.
>
> I would like to see if the 0001 patch(attaching here) will be useful
> at all. It just splits up the existing ExecRefreshMatView into a few
> functions to make the code readable.

+1.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 21:30:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
> >> Huh, interesting.  I wonder why that makes the ident problem go away?
> 
> > I spent a bit of time looking at valgrind, and it looks to be explicit
> > behaviour:
> >
> >// Our goal is to construct a set of chunks that includes every
> >// mempool chunk, and every malloc region that *doesn't* contain a
> >// mempool chunk.
> 
> Ugh.
> 
> > I guess that makes sense, but would definitely be nice to have had
> > documented...
> 
> Indeed.  So this started happening to us when we merged the aset
> header with its first allocation block.

Yea.

I have the feeling that valgrinds error detection capability in our
codebase used to be higher too. I wonder if that could be related
somehow. Or maybe it's just a figment of my imagination.


> I believe I've just tracked down the cause of that.  Those errors
> are (as far as I've seen) only happening in parallel workers, and
> the reason is this gem in RestoreGUCState:
> 
>   /* See comment at can_skip_gucvar(). */
>   for (i = 0; i < num_guc_variables; i++)
>   if (!can_skip_gucvar(guc_variables[i]))
>   InitializeOneGUCOption(guc_variables[i]);
> 
> where InitializeOneGUCOption zeroes out that GUC variable, causing
> it to lose track of any allocations it was responsible for.

Ouch. At least it's a short lived issue rather than something permanently
leaking memory on every SIGHUP...



> At minimum, somebody didn't understand the difference between "initialize"
> and "reset".  But I suspect we could just nuke this loop altogether.
> I've not yet tried to grok the comment that purports to justify it,
> but I fail to see why it'd ever be useful to drop values inherited
> from the postmaster.

I can't really make sense of it either. I think it may be trying to
restore the GUC state to what it would have been in postmaster,
disregarding all the settings that were set as part of PostgresInit()
etc?

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Masahiko Sawada

On Wed, Mar 17, 2021 at 7:21 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada  wrote:
> > > Note that I've merged multiple existing functions in vacuumlazy.c into
> > > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > > into a single function named vacuum_indexes_mark_unused()
>
> > I agree to create a function like vacuum_indexes_mark_unused() that
> > makes a decision and does index and heap vacumming accordingly. But
> > what is the point of removing both lazy_vacuum_all_indexes() and
> > lazy_vacuum_heap()? I think we can simply have
> > vacuum_indexes_mark_unused() call those functions. I'm concerned that
> > if we add additional criteria to vacuum_indexes_mark_unused() in the
> > future the function will become very large.
>
> I agree now. I became overly excited about advertising the fact that
> these two functions are logically one thing. This is important, but it
> isn't necessary to go as far as actually making everything into one
> function. Adding some comments would also make that point clear, but
> without making vacuumlazy.c even more spaghetti-like. I'll fix it.
>
> > > I wonder if we can add some kind of emergency anti-wraparound vacuum
> > > logic to what I have here, for Postgres 14.
>
> > +1
> >
> > I think we can set VACOPT_TERNARY_DISABLED to
> > tab->at_params.index_cleanup in table_recheck_autovac() or increase
> > the thresholds used to not skipping index vacuuming.
>
> I was worried about the "tupgone = true" special case causing problems
> when we decide to do some index vacuuming and some heap
> vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
> But I now think that getting rid of "tupgone = true" gives us total
> freedom to
> choose what to do, including the freedom to start with index vacuuming
> and then give up on it later -- even after we do some amount of
> LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
> due to a low maintenance_work_mem setting). That isn't actually
> special at all, because everything will be 100% decoupled.
>
> Whether or not it's a good idea to either not start index vacuuming or
> to end it early (e.g. due to XID wraparound pressure) is a complicated
> question, and the right approach will be debatable in each case/when
> thinking about each issue. However, deciding on the best behavior to
> address these problems should have nothing to do with implementation
> details and everything to do with the costs and benefits to users.
> Which now seems possible.
>
> A sophisticated version of the "XID wraparound pressure"
> implementation could increase reliability without ever being
> conservative, just by evaluating the situation regularly and being
> prepared to change course when necessary -- until it is truly a matter
> of shutting down new XID allocations/the server. It should be possible
> to decide to end VACUUM early and advance relfrozenxid (provided we
> have reached the point of finishing all required pruning and freezing,
> of course). Highly agile behavior seems quite possible, even if it
> takes a while to agree on a good design.

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

>
> > Aside from whether it's good or bad, strictly speaking, it could
> > change the index AM API contract. The documentation of
> > amvacuumcleanup() says:
> >
> > ---
> > stats is whatever the last ambulkdelete call returned, or NULL if
> > ambulkdelete was not called because no tuples needed to be deleted.
> > ---
> >
> > With this change, we could pass NULL to amvacuumcleanup even though
> > the index might have tuples needed to be deleted.
>
> I think that we should add a "Note" box to the documentation that
> notes the difference here. Though FWIW my interpretation of the words
> "no tuples needed to be deleted" was always "no tuples needed to be
> deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
> VACUUM can add to tups_vacuumed through pruning inside
> heap_prune_chain(). It is already possible (though only barely) to not
> call ambulkdelete() for indexes (to only call amvacuumcleanup() during
> cleanup) despite the fact that heap vacuuming did "delete tuples".

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


poc_skip_index_cleanup_at_anti_wraparound.patch
Description: Binary data

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-18 09:55:46 +0900, Michael Paquier wrote:
> Let's see how it goes from this point, but, FWIW, I have not been able
> to reproduce again my similar problem with the archive command :/ --

I suspect it might be easier to reproduce the issue with smaller WAL
segments, a short checkpoint_timeout, and multiple jobs generating WAL
and then sleeping for random amounts of time. Not sure if that's the
sole ingredient, but consider what happens there's processes that
XLogWrite()s some WAL and then sleeps. Typically such a process'
openLogFile will still point to the WAL segment. And they may still do
that when the next checkpoint finishes and we recycle the WAL file.

I wonder if we actually fail to unlink() the file in
durable_link_or_rename(), and then end up recycling the same old file
into multiple "future" positions in the WAL stream.

There's also these interesting notes at
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createhardlinka

1)
> The security descriptor belongs to the file to which a hard link
> points. The link itself is only a directory entry, and does not have a
> security descriptor. Therefore, when you change the security
> descriptor of a hard link, you a change the security descriptor of the
> underlying file, and all hard links that point to the file allow the
> newly specified access. You cannot give a file different security
> descriptors on a per-hard-link basis.

2)
> Flags, attributes, access, and sharing that are specified in
> CreateFile operate on a per-file basis. That is, if you open a file
> that does not allow sharing, another application cannot share the file
> by creating a new hard link to the file.

3)
> The maximum number of hard links that can be created with this
> function is 1023 per file. If more than 1023 links are created for a
> file, an error results.

1) and 2) seems problematic for restore_command use. I wonder if there's
a chance that some of the reports ended up hitting 3), and that windows
doesn't handle that well.

If you manage to reproduce, could you check what the link count of the
all the segments is? Apparently sysinternal's findlinks can do that.

Or perhaps even better, add an error check that the number of links of
WAL segments is 1 in a bunch of places (recycling, opening them, closing
them, maybe?).

Plus error reporting for unlink failures, of course.

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-17 21:30:48 -0400, Tom Lane wrote:
>> I believe I've just tracked down the cause of that.  Those errors
>> are (as far as I've seen) only happening in parallel workers, and
>> the reason is this gem in RestoreGUCState: ...

> Ouch. At least it's a short lived issue rather than something permanently
> leaking memory on every SIGHUP...

Yeah.  This leak is really insignificant in practice, although I'm
suspicious that randomly init'ing GUCs like this might have semantic
issues that we've not detected yet.

>> I've not yet tried to grok the comment that purports to justify it,
>> but I fail to see why it'd ever be useful to drop values inherited
>> from the postmaster.

> I can't really make sense of it either. I think it may be trying to
> restore the GUC state to what it would have been in postmaster,
> disregarding all the settings that were set as part of PostgresInit()
> etc?

At least in a non-EXEC_BACKEND build, the most reliable way to reproduce
the postmaster's settings is to do nothing whatsoever.  And I think the
same is true for EXEC_BACKEND, really, because the guc.c subsystem is
responsible for restoring what would have been the inherited-via-fork
settings.  So I'm really not sure what this is on about, and I'm too
tired to try to figure it out tonight.

In other news, I found that there's a genuine leak in
RelationBuildLocalRelation: RelationInitTableAccessMethod
was added in a spot where CurrentMemoryContext is CacheMemoryContext,
which is bad because it does a syscache lookup that can lead to
a table access which can leak some memory.  Seems easy to fix though.

The plpgsql situation looks like a mess.  As a short-term answer,
I'm inclined to recommend adding an exclusion that will ignore anything
allocated within plpgsql_compile().  Doing better will require a fair
amount of rewriting.  (Although I suppose we could also consider adding
an on_proc_exit callback that runs through and frees all the function
cache entries.)

regards, tom lane

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Justin Pryzby

On Fri, Mar 12, 2021 at 04:05:09PM +0900, Amit Langote wrote:
> On Fri, Mar 12, 2021 at 6:10 AM Justin Pryzby  wrote:
> > Note also this CF entry
> > https://commitfest.postgresql.org/32/2987/
> > | Allow setting parallel_workers on partitioned tables

I'm rebasing that other patch on top of master (with this patch), and I noticed
that it adds this bit, and this patch should have done something similar:

--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1340,4 +1340,5 @@ WITH ( MODULUS numeric_literal, REM
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of 
parallel_workers,
+are not supported on partitioned tables, but you may specify them for
+individual leaf partitions.


-- 
Justin

RE: libpq debug log

2021-03-17 Thread tsunakawa.ta...@fujitsu.com

I'll look at the comments from Alvaro-san and Horiguchi-san.  Here are my 
review comments:


(23)
+   /* Trace message only when there is first 1 byte */
+   if (conn->Pfdebug && conn->outCount < conn->outMsgStart)
+   {
+   if (conn->outCount < conn->outMsgStart)
+   pqTraceOutputMessage(conn, conn->outBuffer + 
conn->outCount, true);
+   else
+   pqTraceOutputNoTypeByteMessage(conn,
+   
conn->outBuffer + conn->outMsgStart);
+   }

The inner else path doesn't seem to be reached because both the outer and inner 
if contain the same condition.  I think you want to remove the second condition 
from the outer if.


(24) pqTraceOutputNoTypeByteMessage
+   case 16:/* CancelRequest */
+   fprintf(conn->Pfdebug, "%s\t>\tCancelRequest\t%d", 
timestr, length);
+   pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug);
+   pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug);
+   break;

Another int32 data needs to be output as follows:

--
Int32(80877102)
The cancel request code. The value is chosen to contain 1234 in the most 
significant 16 bits, and 5678 in the least significant 16 bits. (To avoid 
confusion, this code must not be the same as any protocol version number.)

Int32
The process ID of the target backend.

Int32
The secret key for the target backend.
--


(25)
+   case 8 :/* GSSENRequest or SSLRequest */

GSSENRequest -> GSSENCRequest


(26)
+static void
+pqTraceOutputByte1(const char *data, int *cursor, FILE *pfdebug)
+{
+   const char *v = data + *cursor;
+   /*

+static void
+pqTraceOutputf(const char *message, int end, FILE *f)
+{
+   int cursor = 0;
+   pqTraceOutputString(message, &cursor, end, f);
+}

Put an empty line to separate the declaration part and execution part.


(27)
+   const char  *message_type = "UnkownMessage";
+
+   id = message[LogCursor++];
+
+   memcpy(&length, message + LogCursor , 4);
+   length = (int) pg_ntoh32(length);
+   LogCursor += 4;
+   LogEnd = length - 4;

+   /* Get a protocol type from first byte identifier */
+   if (toServer &&
+   id < lengthof(protocol_message_type_f) &&
+   protocol_message_type_f[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_f[(unsigned char)id];
+   else if (!toServer &&
+   id < lengthof(protocol_message_type_b) &&
+   protocol_message_type_b[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_b[(unsigned char)id];
+   else
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+   }
+

The initial value "UnkownMessage" is not used.  So, you can initialize 
message_type with NULL and do like:

+if (...)
+   ...
+   else if (...)
+   ...
+
+   if (message_type == NULL)
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+

Plus, I think this should be done before looking at the message length.


(28)
pqTraceOutputBinary() is only used in pqTraceOutputNchar().  Do they have to be 
separated?


Regards
Takayuki Tsunakawa

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund  wrote:
> I kinda wonder whether this case should just be handled by just gotoing
> back to the start of the blkno loop, and redoing the pruning. The only
> thing that makes that a bit more complicatd is that we've already
> incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.
>
> We really should put the per-page work (i.e. the blkno loop body) of
> lazy_scan_heap() into a separate function, same with the
> too-many-dead-tuples branch.

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Here are some key points about the first patch:

* Eliminates the "tupgone = true" special case by putting pruning, the
HTSV() call, as well as tuple freezing into a new, dedicated function
-- the function is prepared to restart pruning in those rare cases
where the vacuumlazy.c HTSV() call indicates that a tuple is dead.
Restarting pruning again prunes again, rendering the DEAD tuple with
storage an LP_DEAD line pointer stub.

The restart thing is based on Andres' suggestion.

This patch enables incremental VACUUM (the second patch, and likely
other variations) by allowing us to make a uniform assumption that it
is never strictly necessary to reach lazy_vacuum_all_indexes() or
lazy_vacuum_heap(). It is now possible to "end VACUUM early" while
still advancing relfrozenxid. Provided we've finished the first scan
of the heap, that should be safe.

* In principle we could visit and revisit the question of whether or
not vacuuming should continue or end early many times, as new
information comes to light. For example, maybe Masahiko's patch from
today could be taught to monitor how old relfrozenxid is again and
again, before finally giving up early when the risk of wraparound
becomes very severe -- but not before then.

* I've added three structs that replace a blizzard of local variables
we used lazy_scan_heap() with just three (three variables for each of
the three structs). I've also moved several chunks of logic to other
new functions (in addition to one that does pruning and freezing).

I think that I have the data structures roughly right here -- but I
would appreciate some feedback on that. Does this seem like the right
direction?

-- 
Peter Geoghegan

v3-0002-Skip-index-vacuuming-dynamically.patch
Description: Binary data

v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patch
Description: Binary data

RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-17 Thread kuroda.hay...@fujitsu.com

Dear Ikeda-san,

I confirmed new patch and no problem was found. Thanks.
(I'm not a native English speaker, so I cannot check your comments correctly, 
sorry)

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Justin Pryzby

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index ff1b642722..d5d356f2de 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM
 If a table parameter value is set and the
 equivalent toast. parameter is not, the TOAST table
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of
+parallel_insert_enabled,
+are not supported on partitioned tables, but may be specified for
+individual leaf partitions.

 

>From 55c7326e56f9b49710a564e91c46212a17f12b24 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 17 Mar 2021 21:47:27 -0500
Subject: [PATCH] Fix for parallel_insert_enabled

---
 doc/src/sgml/ref/create_table.sgml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index ff1b642722..d5d356f2de 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM
 If a table parameter value is set and the
 equivalent toast. parameter is not, the TOAST table
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of
+parallel_insert_enabled,
+are not supported on partitioned tables, but may be specified for
+individual leaf partitions.

 

-- 
2.17.0

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep.  That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server.  This has worked for me.  Once.  But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right?  Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap.  I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
> 
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
> 
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs.  Still no luck unfortunately :(
--
Michael


signature.asc
Description: PGP signature

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 00:01:55 -0400, Tom Lane wrote:
> As for the particular point about ParallelBlockTableScanWorkerData,
> I agree with your question to David about why that's in TableScanDesc
> not HeapScanDesc, but I can't get excited about it not being freed in
> heap_endscan. That's mainly because I do not believe that anything as
> complex as a heap or indexscan should be counted on to be zero-leakage.
> The right answer is to not do such operations in long-lived contexts.
> So if we're running such a thing while switched into CacheContext,
> *that* is the bug, not that heap_endscan didn't free this particular
> allocation.

I agree that it's a bad idea to do scans in non-transient contexts. It
does however seems like there's a number of places that do...

I added the following hacky definition of "permanent" contexts

/*
 * NB: Only for assertion use.
 *
 * TopMemoryContext itself obviously is permanent. Treat CacheMemoryContext
 * and all its children as permanent too.
 *
 * XXX: Might be worth adding this as an explicit flag on the context?
 */
bool
MemoryContextIsPermanent(MemoryContext c)
{
if (c == TopMemoryContext)
return true;

while (c)
{
if (c == CacheMemoryContext)
return true;
c = c->parent;
}

return false;
}

and checked that the CurrentMemoryContext is not permanent in
SearchCatCacheInternal() and systable_beginscan(). Hit a number of
times.

The most glaring case is the RelationInitTableAccessMethod() call in
RelationBuildLocalRelation(). Seems like the best fix is to just move
the MemoryContextSwitchTo() to just before the
RelationInitTableAccessMethod().  Although I wonder if we shouldn't go
further, and move it to much earlier, somewhere after the rd_rel
allocation.

There's plenty other hits, but I think I should get back to working on
making the shared memory stats patch committable. I really wouldn't want
it to slip yet another year.

But I think it might make sense to add a flag indicating contexts that
shouldn't be used for non-transient data. Seems like we fairly regularly
have "bugs" around this?

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

On 2021-03-17 22:33:59 -0400, Tom Lane wrote:
> >> I've not yet tried to grok the comment that purports to justify it,
> >> but I fail to see why it'd ever be useful to drop values inherited
> >> from the postmaster.
> 
> > I can't really make sense of it either. I think it may be trying to
> > restore the GUC state to what it would have been in postmaster,
> > disregarding all the settings that were set as part of PostgresInit()
> > etc?
> 
> At least in a non-EXEC_BACKEND build, the most reliable way to reproduce
> the postmaster's settings is to do nothing whatsoever.  And I think the
> same is true for EXEC_BACKEND, really, because the guc.c subsystem is
> responsible for restoring what would have been the inherited-via-fork
> settings.  So I'm really not sure what this is on about, and I'm too
> tired to try to figure it out tonight.

The restore thing runs after we've already set and initialized GUCs,
including things like user/database default GUCs. Is see things like

==2251779== 4,560 bytes in 1 blocks are definitely lost in loss record 383 of 
406
==2251779==at 0x483877F: malloc (vg_replace_malloc.c:307)
==2251779==by 0x714A45: ConvertTimeZoneAbbrevs (datetime.c:4556)
==2251779==by 0x88DE95: load_tzoffsets (tzparser.c:465)
==2251779==by 0x88511D: check_timezone_abbreviations (guc.c:11565)
==2251779==by 0x884633: call_string_check_hook (guc.c:11232)
==2251779==by 0x87CB57: parse_and_validate_value (guc.c:7012)
==2251779==by 0x87DD5F: set_config_option (guc.c:7630)
==2251779==by 0x88397F: ProcessGUCArray (guc.c:10784)
==2251779==by 0x32BCCF: ApplySetting (pg_db_role_setting.c:256)
==2251779==by 0x874CA2: process_settings (postinit.c:1163)
==2251779==by 0x874A0B: InitPostgres (postinit.c:1048)
==2251779==by 0x60129A: BackgroundWorkerInitializeConnectionByOid 
(postmaster.c:5681)
==2251779==by 0x2BB283: ParallelWorkerMain (parallel.c:1374)
==2251779==by 0x5EBA6A: StartBackgroundWorker (bgworker.c:864)
==2251779==by 0x6014FE: do_start_bgworker (postmaster.c:5802)
==2251779==by 0x6018D2: maybe_start_bgworkers (postmaster.c:6027)
==2251779==by 0x600811: sigusr1_handler (postmaster.c:5190)
==2251779==by 0x4DD513F: ??? (in 
/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
==2251779==by 0x556E865: select (select.c:41)
==2251779==by 0x5FC0CB: ServerLoop (postmaster.c:1700)
==2251779==by 0x5FBA68: PostmasterMain (postmaster.c:1408)
==2251779==by 0x4F8BFD: main (main.c:209)

The BackgroundWorkerInitializeConnectionByOid() in that trace is before
the RestoreGUCState() later in ParallelWorkerMain().  But that doesn't
really explain the approach.


> In other news, I found that there's a genuine leak in
> RelationBuildLocalRelation: RelationInitTableAccessMethod
> was added in a spot where CurrentMemoryContext is CacheMemoryContext,
> which is bad because it does a syscache lookup that can lead to
> a table access which can leak some memory.  Seems easy to fix though.

Yea, that's the one I was talking about re
ParallelBlockTableScanWorkerData not being freed. While I think we
should also add that pfree, I think you're right, and we shouldn't do
RelationInitTableAccessMethod() while in CacheMemoryContext.


> The plpgsql situation looks like a mess.  As a short-term answer,
> I'm inclined to recommend adding an exclusion that will ignore anything
> allocated within plpgsql_compile().  Doing better will require a fair
> amount of rewriting.  (Although I suppose we could also consider adding
> an on_proc_exit callback that runs through and frees all the function
> cache entries.)

The error variant of this one seems like it might actually be a
practically relevant leak? As well as increasing memory usage for
compiled plpgsql functions unnecessarily, of course. The latter would be
good to fix, but the former seems like it might be a practical issue for
poolers and the like?

So I think we should do at least the reparenting thing to address that?

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> The most glaring case is the RelationInitTableAccessMethod() call in
> RelationBuildLocalRelation(). Seems like the best fix is to just move
> the MemoryContextSwitchTo() to just before the
> RelationInitTableAccessMethod().  Although I wonder if we shouldn't go
> further, and move it to much earlier, somewhere after the rd_rel
> allocation.

Yeah, same thing I did locally.  Not sure if it's worth working harder.

> There's plenty other hits, but I think I should get back to working on
> making the shared memory stats patch committable. I really wouldn't want
> it to slip yet another year.

+1, so far there's little indication that we're finding any serious leaks
here.  Might be best to set it all aside till there's more time.

regards, tom lane

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada  wrote:
> Since I was thinking that always skipping index vacuuming on
> anti-wraparound autovacuum is legitimate, skipping index vacuuming
> only when we're really close to the point of going into read-only mode
> seems a bit conservative, but maybe a good start. I've attached a PoC
> patch to disable index vacuuming if the table's relfrozenxid is too
> older than autovacuum_freeze_max_age (older than 1.5x of
> autovacuum_freeze_max_age).

Most anti-wraparound VACUUMs are really not emergencies, though. So
treating them as special simply because they're anti-wraparound
vacuums doesn't seem like the right thing to do. I think that we
should dynamically decide to do this when (antiwraparound) VACUUM has
already been running for some time. We need to delay the decision
until it is almost certainly true that we really have an emergency.

Can you take what you have here, and make the decision dynamic? Delay
it until we're done with the first heap scan? This will require
rebasing on top of the patch I posted. And then adding a third patch,
a little like the second patch -- but not too much like it.

In the second/SKIP_VACUUM_PAGES_RATIO patch I posted today, the
function two_pass_strategy() (my new name for the main entry point for
calling lazy_vacuum_all_indexes() and lazy_vacuum_heap()) is only
willing to perform the "skip index vacuuming" optimization when the
call to two_pass_strategy() is the first call and the last call for
that entire VACUUM (plus we test the number of heap blocks with
LP_DEAD items using SKIP_VACUUM_PAGES_RATIO, of course). It works this
way purely because I don't think that we should be aggressive when
we've already run out of maintenance_work_mem. That's a bad time to
apply a performance optimization.

But what you're talking about now isn't a performance optimization
(the mechanism is similar or the same, but the underlying reasons are
totally different) -- it's a safety/availability thing. I don't think
that you need to be concerned about running out of
maintenance_work_mem in two_pass_strategy() when applying logic that
is concerned about keeping the database online by avoiding XID
wraparound. You just need to have high confidence that it is a true
emergency. I think that we can be ~99% sure that we're in a real
emergency by using dynamic information about how old relfrozenxid is
*now*, and by rechecking a few times during VACUUM. Probably by
rechecking every time we call two_pass_strategy().

I now believe that there is no fundamental correctness issue with
teaching two_pass_strategy() to skip index vacuuming when we're low on
memory -- it is 100% a matter of costs and benefits. The core
skip-index-vacuuming mechanism is very flexible. If we can be sure
that it's a real emergency, I think that we can justify behaving very
aggressively (letting indexes get bloated is after all very
aggressive). We just need to be 99%+ sure that continuing with
vacuuming will be worse that ending vacuuming. Which seems possible by
making the decision dynamic (and revisiting it at least a few times
during a very long VACUUM, in case things change).

-- 
Peter Geoghegan

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Amit Kapila

On Thu, Mar 18, 2021 at 8:30 AM Justin Pryzby  wrote:
>
> diff --git a/doc/src/sgml/ref/create_table.sgml 
> b/doc/src/sgml/ref/create_table.sgml
> index ff1b642722..d5d356f2de 100644
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -1338,8 +1338,10 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>  If a table parameter value is set and the
>  equivalent toast. parameter is not, the TOAST table
>  will use the table's parameter value.
> -Specifying these parameters for partitioned tables is not supported,
> -but you may specify them for individual leaf partitions.
> +These parameters, with the exception of
> +parallel_insert_enabled,
> +are not supported on partitioned tables, but may be specified for
> +individual leaf partitions.
> 
>

Your patch looks good to me. While checking this, I notice a typo in
the previous patch:
-  planner parameter parallel_workers.
+  planner parameter parallel_workers and
+  parallel_insert_enabled.

Here, it should be /planner parameter/planner parameters.

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread houzj.f...@fujitsu.com

> >  If a table parameter value is set and the
> >  equivalent toast. parameter is not, the TOAST table
> >  will use the table's parameter value.
> > -Specifying these parameters for partitioned tables is not supported,
> > -but you may specify them for individual leaf partitions.
> > +These parameters, with the exception of
> > +parallel_insert_enabled,
> > +are not supported on partitioned tables, but may be specified for
> > +individual leaf partitions.
> > 
> >
> 
> Your patch looks good to me. While checking this, I notice a typo in the
> previous patch:
> -  planner parameter parallel_workers.
> +  planner parameter parallel_workers and
> +  parallel_insert_enabled.
> 
> Here, it should be /planner parameter/planner parameters.

Thanks amit and justin for pointing this out !
The changes looks good to me.

Best regards,
houzj

1 2 >

1 - 100 of 117 matches

Mail list logo