Re: Wait profiling

2020-07-11 Thread Julien Rouhaud
On Fri, Jul 10, 2020 at 10:37 PM Alvaro Herrera
 wrote:
>
> On 2020-Jul-10, Daniel Wood wrote:
>
> > After nearly 5 years does something like the following yet exist?
> > https://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru
>
> Yes, we have pg_stat_activity.wait_events which implement pretty much
> what Ildus describes there.
>
> > 1) An option to "explain" to produce a wait events profile.
> > postgres=# explain (analyze, waitprofile) update pgbench_accounts set 
> > bid=bid+1 where aid < 200;
> > ...
> > Execution time: 23111.231 ms
>
> There's an out-of-core extension, search for pg_wait_sampling.  I
> haven't tested it yet ...

I use it, and I know multiple people that are also using it (or about
to, it's currently being packaged) in production.  It's working quite
well and is compatible with pg_stat_statements' queryid.  You can see
some examples of dashboards that can be built on top of this extension
at 
https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html.




Re: output columns of \dAo and \dAp

2020-07-11 Thread Alexander Korotkov
On Fri, Jul 10, 2020 at 2:24 AM Alexander Korotkov  wrote:
> On Thu, Jul 9, 2020 at 10:03 PM Jonathan S. Katz  wrote:
> > From the RMT perspective, if there is an agreed upon approach (which it
> > sounds like from the above) can someone please commit to working on
> > resolving this open item?
>
> I hardly can extract an approach from this thread, because for me the
> whole issue is about details :)
>
> But I think we can come to an agreement shortly.  And yes, I commit to
> resolve this.

The proposed patch is attached.  This patch is fixes two points:
 * Adds strategy number and purpose to output of \dAo
 * Renames "Left/right arg type" columns of \dAp to "Registered left/right type"

I'm not yet convinced we should change the sort key for \dAo.

Any thoughts?

--
Regards,
Alexander Korotkov


0001-Improvements-to-psql-dAo-and-dAp-commands.patch
Description: Binary data


Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Tomas Vondra

On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:

Tomas Vondra  schrieb am Fr., 10. Juli 2020,
14:09:


On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>On Fri, 3 Jul 2020 at 09:58, Tomas Vondra 
wrote:
>>
>> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> > wrote:
>> ...
>> >> >
>> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >inclined to rush on these three as well.  But you're willing to
commit
>> >> >them, you can count round of review on me.
>> >> >
>> >>
>> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> are beneficial on their own, but the primary reason was to support
the
>> >> new opclasses (which require those changes). And those parts are not
>> >> going to make it into v13 ...
>> >
>> >OK, no problem.
>> >Let's do this for v14.
>> >
>>
>> Hi Alexander,
>>
>> Are you still interested in reviewing those patches? I'll take a look at
>> 0001-0003 to check that your previous feedback was addressed. Do you
>> have any comments about 0004 / 0005, which I think are the more
>> interesting parts of this series?
>>
>>
>> Attached is a rebased version - I realized I forgot to include 0005 in
>> the last update, for some reason.
>>
>
>I've done a quick test with this patch set. I wonder if we can improve
>brin_page_items() SQL function in pageinspect as well. Currently,
>brin_page_items() is hard-coded to support only normal brin indexes.
>When we pass brin-bloom or brin-multi-range to that function the
>binary values are shown in 'value' column but it seems not helpful for
>users. For instance, here is an output of brin_page_items() with a
>brin-multi-range index:
>
>postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>2), 'mul');
>-[ RECORD 1
]--

>---
>
>itemoffset  | 1
>blknum  | 0
>attnum  | 1
>allnulls| f
>hasnulls| f
>placeholder | f
>value   |
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef

>70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>071}
>

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?


You can be sure with the next node. The value is in can be false positiv.
The value is out is clear. You can detect the change between in and out.



I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?

The problem here is that while plain BRIN opclasses have fairly simple
summary that can be stored using a fixed number of simple data types
(e.g. minmax will store two values with the same data types as the
indexd column)

result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
 sizeof(MinmaxOpaque));
result->oi_nstored = 2;
result->oi_opaque = (MinmaxOpaque *)
MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
result->oi_typcache[0] = result->oi_typcache[1] =
lookup_type_cache(typoid, 0);

The opclassed introduced here have somewhat more complex summary, stored
as a single bytea value - which is what gets printed by brin_page_items.

To print something easier to read (for humans) we'd either have to teach
brin_page_items about the diffrent opclasses (multi-range, bloom) end
how to parse the summary bytea, or we'd have to extend the opclasses
with a function formatting the summary. Or rework how the summary is
stored, but that seems like the worst option.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [PATCH] pg_dump: Add example and link for --encoding option

2020-07-11 Thread Peter Eisentraut

On 2020-06-05 14:45, 이동욱 wrote:
I've modified my previous patch because it linked the wrong document so 
I fixed it. and I add a highlight to the encoding list for readability.


What do you think about this change?


The wording in your patch needed a bit of editing but adding a link to 
the supported encodings seems sensible.  I have committed it with a new 
wording.




Thanks,
Dong Wook

2020년 6월 4일 (목) 오후 10:20, 이동욱 >님이 작성:


To let users know what kind of character set
can be used add examples and a link to --encoding option.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Binary support for pgoutput plugin

2020-07-11 Thread Dave Cramer
On Fri, 10 Jul 2020 at 14:21, Tom Lane  wrote:

> Daniel Gustafsson  writes:
> > Thanks for the update!  Do note that my patch included a new file which
> is
> > missing from this patchset:
> >   src/test/subscription/t/014_binary.pl
> > This is, IMO, the most interesting test of this feature so it would be
> good to
> > be included.  It's a basic test and can no doubt be extended to be even
> more
> > relevant, but it's a start.
>
> I was about to complain that the latest patchset includes no meaningful
> test cases, but I assume that this missing file contains those.
>
> >> I did not do the macro for updated, inserted, deleted, will give that a
> go tomorrow.
>
> > This might not be a blocker, but personally I think it would make the
> > code more readable. Anyone else have an opinion on this?
>
> +1 for using macros.
>

Got it, will add.

>
> > Reading through the new patch, and running the tests, I'm marking this
> as Ready
> > for Committer.  It does need some cosmetic TLC, quite possibly just from
> > pg_indent but I didn't validate if it will take care of everything, and
> comment
> > touchups (there is still a TODO comment around wording that needs to be
> > resolved).  However, I think it's in good enough shape for consideration
> at
> > this point.
>
> I took a quick look through the patch and had some concerns:
>
> * Please strip out the PG_VERSION_NUM and USE_INTEGER_DATETIMES checks.
> Those are quite dead so far as a patch for HEAD is concerned --- in fact,
> since USE_INTEGER_DATETIMES hasn't even been defined since v10 or so,
> the patch is actively doing the wrong thing there.  Not that it matters.
> This code will never appear in any branch where float timestamps could
> be a thing.
>
> * I doubt that the checks on USE_FLOAT4/8_BYVAL, sizeof(int), endiannness,
> etc, make any sense either.  Those surely do not affect the on-the-wire
> representation of values --- or if they do, we've blown it somewhere else.
> I'd just take out all those checks and assume that the binary
> representation is sufficiently portable.  (If it's not, it's more or less
> the user's problem, just as in binary COPY.)
>

So is there any point in having them as options then ?

>
> * Please also remove debugging hacks such as enabling WAL_DEBUG.
>
> * It'd likely be wise for the documentation to point out that binary
> mode only works if all types to be transferred have send/receive
> functions.
>

will do

>
>
> BTW, while it's not the job of this patch to fix it, I find it quite
> distressing that we're apparently repeating the lookups of the type
> I/O functions for every row transferred.
>
> I'll set this back to WoA, but I concur with Daniel's opinion that
> it doesn't seem that far from committability.
>

Thanks for looking at this

Dave Cramer


INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-11 Thread Dilip Kumar
I have just notice that the parallelism is off even for the select
part of the query mentioned in the $subject.  I see the only reason it
is not getting parallel because we block the parallelism if the query
type is not SELECT.  I don't see any reason for not selecting the
parallelism for this query.  I have quickly hacked the code to enable
the parallelism for this query.  I can see there is a significant
improvement if we can enable the parallelism in this case.  For an
experiment, I have just relaxed a couple of checks, maybe if we think
that it's good to enable the parallelism for this case we can try to
put better checks which are specific for this quey.

No parallel:
postgres[36635]=# explain analyze insert into t2 select * from t where a < 100;
 Insert on t2  (cost=0.00..29742.00 rows=100 width=105) (actual
time=278.300..278.300 rows=0 loops=1)
   ->  Seq Scan on t  (cost=0.00..29742.00 rows=100 width=105) (actual
time=0.061..277.330 rows=99 loops=1)
 Filter: (a < 100)
 Rows Removed by Filter: 01
 Planning Time: 0.093 ms
 Execution Time: 278.330 ms
(6 rows)

With parallel
 Insert on t2  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=108.410..108.410 rows=0 loops=1)
   ->  Gather  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=0.306..108.973 rows=99 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Parallel Seq Scan on t  (cost=0.00..22450.33 rows=42
width=105) (actual time=66.396..101.979 rows=33 loops=3)
   Filter: (a < 100)
   Rows Removed by Filter: 00
 Planning Time: 0.154 ms
 Execution Time: 110.158 ms
(9 rows)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


POC_parallel_insert_into.patch
Description: Binary data


Re: Include access method in listTables output

2020-07-11 Thread vignesh C
On Mon, Jul 6, 2020 at 1:24 PM Georgios  wrote:
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Monday, July 6, 2020 3:12 AM, Michael Paquier  wrote:
>
> > On Sun, Jul 05, 2020 at 07:13:10AM +0530, vignesh C wrote:
> >
> > > I'm not sure if we should include showViews, I had seen that the
> > > access method was not getting selected for view.
> >
> > +1. These have no physical storage, so you are looking here for
> > relkinds that satisfy RELKIND_HAS_STORAGE().
>
> Thank you for the review.
> Find attached v4 of the patch.

Thanks for fixing the comments.
Patch applies cleanly, make check & make check-world passes.
I was not sure if any documentation change is required or not for this
in doc/src/sgml/ref/psql-ref.sgml. Thoughts?


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Remove Extra palloc Of raw_buf For Binary Format In COPY FROM

2020-07-11 Thread vignesh C
On Tue, Jun 30, 2020 at 2:41 PM Bharath Rupireddy
 wrote:
>
> Thanks Vignesh and Rushabh for reviewing this.
>
> I've added this patch to commitfest - https://commitfest.postgresql.org/28/.

I felt this patch is ready for committer, changing the status to ready
for committer.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-07-11 Thread Bharath Rupireddy
Thanks for the comments. Attaching the v2 patch.

>
> > One way, we could solve the above problem is that, upon firing the new
> > foreign query from local backend using the cached connection,
> > (assuming the remote backend that was cached in the local backed got
> > killed for some reasons), instead of failing the query in the local
> > backend, upon detecting error from the remote backend, we could just
> > delete the cached old entry and try getting another connection to
> > remote backend, cache it and proceed to submit the query. This has to
> > happen only at the beginning of remote xact.
> +1.
>
> I think this is a very useful feature.
> In an environment with connection pooling for local, if a remote
> server has a failover or switchover,
> this feature would prevent unexpected errors of local queries after
> recovery of the remote server.

Thanks for backing this feature.

>
> I haven't looked at the code in detail yet, some comments here.
>

Thanks for the comments. Please feel free to review more of the
attached v2 patch.

>
> 1. To keep the previous behavior (and performance), how about allowing
> the user to specify
>whether or not to retry as a GUC parameter or in the FOREIGN SERVER
OPTION?
>

Do we actually need this? We don't encounter much performance with this
connection retry, as
we just do it at the beginning of the remote xact i.e. only once per a
remote session, if we are
able to establish it's well and good otherwise, the query is bound to fail.

If at all, we need one (if there exists a strong reason to have the
option), then the question is
GUC or the SERVER OPTION?

There's a similar discussion going on having GUC at the core level vs
SERVER OPTION for
postgres_fdw in [1].

>
> 2. How about logging a LOG message when retry was success to let us know
>the retry feature worked or how often the retries worked ?
>

In the v1 patch I added the logging messages, but in v2 patch
"postgres_fdw connection retry is successful" is added. Please note that
all the
new logs are added at level "DEBUG3" as all the existing logs are also at
the same
level.

>
> > I couldn't think of adding a test case to the existing postgres_fdw
> > regression test suite with an automated scenario of the remote backend
> > getting killed.
>
> Couldn't you confirm this by adding a test case like the following?
> ===
> BEGIN;
> -- Generate a connection to remote
> SELECT * FROM ft1 LIMIT 1;
>
> -- retrieve pid of postgres_fdw and kill it
> -- could use the other unique identifier (not postgres_fdw but
> fdw_retry_check, etc ) for application name
> SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE
> backend_type = 'client backend' AND application_name = 'postgres_fdw'
>
> -- COMMIT, so next query will should success if connection-retry works
> COMMIT;
> SELECT * FROM ft1 LIMIT 1;
> ===
>

Yes, this way it works. Thanks for the suggestion. I added the test
case to the postgres_fdw regression test suite. v2 patch has these
changes also.

[1] -
https://www.postgresql.org/message-id/CALj2ACVvrp5%3DAVp2PupEm%2BnAC8S4buqR3fJMmaCoc7ftT0aD2A%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From b3725acc916e0daa2797b9d159fc6cc5cdcf43c6 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 11 Jul 2020 14:57:58 +0530
Subject: [PATCH v2] Retry Cached Remote Connections For postgres_fdw

Remote connections are cached for postgres_fdw in the local backend.
There are high chances that the remote backend can no logner be
avaiable i.e. it can get killed, the subsequent foreign queries
from local backend/session that uses cached connection fails as
the remote backend woule have become unavailable/killed. So,
this patch, solves this problem,
1. local backend/session uses cached connection, but recieves error
2. upon receiving the first error, it deletes the cached entry
3. try to get new connection at the start of begin remote xact
---
 contrib/postgres_fdw/connection.c | 130 +-
 .../postgres_fdw/expected/postgres_fdw.out|  30 
 contrib/postgres_fdw/sql/postgres_fdw.sql |  19 +++
 3 files changed, 175 insertions(+), 4 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..e3658dd704 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -45,6 +45,19 @@
  */
 typedef Oid ConnCacheKey;
 
+typedef enum ConnectionRetryType
+{
+	/* initial value for all cache entries */
+	CONN_RETRY_NONE,
+	/* indicates that the caller is ready to retry connection */
+	CONN_RETRY_READY,
+	/*
+	 * indicates to the caller that the connection may have
+	 * broken, it's okay to retry to get a new connection.
+	*/
+	CONN_RETRY_DO
+}ConnectionRetryType;
+
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
@@ -58,8 +71,15 @@ typed

Re: Binary support for pgoutput plugin

2020-07-11 Thread Petr Jelinek

Hi,

On 11/07/2020 14:14, Dave Cramer wrote:



On Fri, 10 Jul 2020 at 14:21, Tom Lane > wrote:


 > Reading through the new patch, and running the tests, I'm marking
this as Ready
 > for Committer.  It does need some cosmetic TLC, quite possibly
just from
 > pg_indent but I didn't validate if it will take care of
everything, and comment
 > touchups (there is still a TODO comment around wording that needs
to be
 > resolved).  However, I think it's in good enough shape for
consideration at
 > this point.

I took a quick look through the patch and had some concerns:

* Please strip out the PG_VERSION_NUM and USE_INTEGER_DATETIMES checks.
Those are quite dead so far as a patch for HEAD is concerned --- in
fact,
since USE_INTEGER_DATETIMES hasn't even been defined since v10 or so,
the patch is actively doing the wrong thing there.  Not that it matters.
This code will never appear in any branch where float timestamps could
be a thing.

* I doubt that the checks on USE_FLOAT4/8_BYVAL, sizeof(int),
endiannness,
etc, make any sense either.  Those surely do not affect the on-the-wire
representation of values --- or if they do, we've blown it somewhere
else.
I'd just take out all those checks and assume that the binary
representation is sufficiently portable.  (If it's not, it's more or
less
the user's problem, just as in binary COPY.)


So is there any point in having them as options then ?



I am guessing this is copied from pglogical, right? We have them there 
because it can optionally send data in the on-disk format (not the 
network binary format) and there this matters, but for network binary 
format they do not matter as Tom says.


--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Binary support for pgoutput plugin

2020-07-11 Thread Tom Lane
Petr Jelinek  writes:
> On 11/07/2020 14:14, Dave Cramer wrote:
>> So is there any point in having them as options then ?

> I am guessing this is copied from pglogical, right? We have them there 
> because it can optionally send data in the on-disk format (not the 
> network binary format) and there this matters, but for network binary 
> format they do not matter as Tom says.

Ah, I wondered why that was there at all.  Yes, you should just delete
all that logic --- it's irrelevant as long as we use the send/recv
functions.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I don't see hash_mem as being any kind of proper fix- it's just punting
> > to the user saying "we can't figure this out, how about you do it" and,
> > worse, it's in conflict with how we already ask the user that question.
> > Turning it into a multiplier doesn't change that either.
> 
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)
> 
> I agree that hash_mem is a stopgap, whether it's a multiplier or no,
> but at this point it seems difficult to avoid inventing a stopgap.
> Getting rid of the process-global work_mem setting is a research project,
> and one I wouldn't even count on having results from for v14.  In the
> meantime, it seems dead certain that there are applications for which
> the current behavior will be problematic.  hash_mem seems like a cleaner
> and more useful stopgap than the "escape hatch" approach, at least to me.

Have we heard from people running actual applications where there is a
problem with raising work_mem to simply match what's already happening
with the v12 behavior?

Sure, there's been some examples on this thread of people who know the
backend well showing how the default work_mem will cause the v13 HashAgg
to spill to disk when given a query which has poor estimates, and that's
slower than v12 where it ignored work_mem and used a bunch of memory,
but it was also shown that raising work_mem addresses that issue and
brings v12 and v13 back in line.

There was a concern raised that other nodes might then use more memory-
but there's nothing new there, if you wanted to avoid batching with a
HashJoin in v12 you'd have exactly the same issue, and yet folks raise
work_mem all the time to address this, and to get that HashAgg plan in
the first place too when the estimates aren't so far off.

There now seems to be some suggestions that not only should we have a
new GUC, but we should default to having it not be equal to work_mem (or
1.0 or whatever) and instead by higher, to be *twice* or larger whatever
the existing work_mem setting is- meaning that people whose systems are
working just fine and have good estimates that represent their workload
and who get the plans they want may then start seeing differences and
increased memory utilization in places that they *don't* want that, all
because we're scared that someone, somewhere, might see a regression due
to HashAgg spilling to disk.

So, no, I don't agree that 'do nothing' (except ripping out the one GUC
that was already added) is a worse proposal than adding another work_mem
like thing that's only for some nodes types.  There's no way that we'd
even be considering such an approach during the regular development
cycle either- there would be calls for a proper wholistic view, at least
to the point where every node type that could possibly allocate a
reasonable chunk of memory would be covered.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
David Rowley  writes:
> hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> it differently.  The problem is made worse by the fact that we'll only
> release the memory for the hash table during ExecEndHashJoin(). If the
> planner had some ability to provide the executor with knowledge that
> the node would never be rescanned, then the executor could release the
> memory for the hash table after the join is complete.

EXEC_FLAG_REWIND seems to fit the bill already?

regards, tom lane




Re: min_safe_lsn column in pg_replication_slots view

2020-07-11 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Jul-09, Alvaro Herrera wrote:
>> I think we should define InvalidXLogSegNo to be ~((uint64)0) and add a
>> macro to test for that.

> That's overkill really.  I just used zero.  Running
> contrib/test_decoding under valgrind, this now passes.

> I think I'd rather do away with the compare to zero, and initialize to
> something else in GetWALAvailability, though.  What we're doing seems
> unclean and unclear.

Is zero really not a valid segment number?

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Sascha Kuhl
Sorry, my topic is different

Sascha Kuhl  schrieb am Sa., 11. Juli 2020, 15:32:

>
>
>
> Tomas Vondra  schrieb am Sa., 11. Juli
> 2020, 13:24:
>
>> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>> >Tomas Vondra  schrieb am Fr., 10. Juli
>> 2020,
>> >14:09:
>> >
>> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
>> tomas.von...@2ndquadrant.com>
>> >> wrote:
>> >> >>
>> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> >> > wrote:
>> >> >> ...
>> >> >> >> >
>> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >> >inclined to rush on these three as well.  But you're willing to
>> >> commit
>> >> >> >> >them, you can count round of review on me.
>> >> >> >> >
>> >> >> >>
>> >> >> >> I have no intention to get 0001-0003 committed. I think those
>> changes
>> >> >> >> are beneficial on their own, but the primary reason was to
>> support
>> >> the
>> >> >> >> new opclasses (which require those changes). And those parts are
>> not
>> >> >> >> going to make it into v13 ...
>> >> >> >
>> >> >> >OK, no problem.
>> >> >> >Let's do this for v14.
>> >> >> >
>> >> >>
>> >> >> Hi Alexander,
>> >> >>
>> >> >> Are you still interested in reviewing those patches? I'll take a
>> look at
>> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> >> have any comments about 0004 / 0005, which I think are the more
>> >> >> interesting parts of this series?
>> >> >>
>> >> >>
>> >> >> Attached is a rebased version - I realized I forgot to include 0005
>> in
>> >> >> the last update, for some reason.
>> >> >>
>> >> >
>> >> >I've done a quick test with this patch set. I wonder if we can improve
>> >> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >> >When we pass brin-bloom or brin-multi-range to that function the
>> >> >binary values are shown in 'value' column but it seems not helpful for
>> >> >users. For instance, here is an output of brin_page_items() with a
>> >> >brin-multi-range index:
>> >> >
>> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >> >2), 'mul');
>> >> >-[ RECORD 1
>> >>
>> ]--
>> >>
>> >>
>> >---
>> >> >
>> >> >itemoffset  | 1
>> >> >blknum  | 0
>> >> >attnum  | 1
>> >> >allnulls| f
>> >> >hasnulls| f
>> >> >placeholder | f
>> >> >value   |
>> >>
>> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
>> >>
>> >>
>> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>> >> >071}
>> >> >
>> >>
>> >> Hmm. I'm not sure we can do much better, without making the function
>> >> much more complicated. I mean, even with regular BRIN indexes we don't
>> >> really know if the value is plain min/max, right?
>> >>
>> >You can be sure with the next node. The value is in can be false positiv.
>> >The value is out is clear. You can detect the change between in and out.
>> >
>>
>> I'm sorry, I don't understand what you're suggesting. How is any of this
>> related to false positive rate, etc?
>>
>
> Hi,
>
> You check by the bloom filter if a value you're searching is part of the
> node, right?
>
> In case, the value is in the bloom filter you could be mistaken, because
> another value could have the same hash profile, no?
>
> However if the value is out, the filter can not react. You can be sure
> that the value is out.
>
> If you looking for a range or many ranges of values, you traverse many
> nodes. By knowing the value is out, you can state a clear set of nodes that
> form the range. However the border is somehow unsharp because of the false
> positives.
>
> I am not sure if we write about the same. Please confirm, this can be
> needed. Please.
>
> I will try to understand what you write. Interesting
>
> Sascha
>
>>
>> The problem here is that while plain BRIN opclasses have fairly simple
>> summary that can be stored using a fixed number of simple data types
>> (e.g. minmax will store two values with the same data types as the
>> indexd column)
>>
>>  result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
>>   sizeof(MinmaxOpaque));
>>  result->oi_nstored = 2;
>>  result->oi_opaque = (MinmaxOpaque *)
>>  MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
>>  result->oi_typcache[0] = result->oi_typcache[1] =
>>  lookup_type_cache(typoid, 0);
>>
>> The opclassed introduced here have somewhat mo

Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Sascha Kuhl
Tomas Vondra  schrieb am Sa., 11. Juli 2020,
13:24:

> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
> >Tomas Vondra  schrieb am Fr., 10. Juli
> 2020,
> >14:09:
> >
> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> >> wrote:
> >> >>
> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> >> >> > wrote:
> >> >> ...
> >> >> >> >
> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >> >> >inclined to rush on these three as well.  But you're willing to
> >> commit
> >> >> >> >them, you can count round of review on me.
> >> >> >> >
> >> >> >>
> >> >> >> I have no intention to get 0001-0003 committed. I think those
> changes
> >> >> >> are beneficial on their own, but the primary reason was to support
> >> the
> >> >> >> new opclasses (which require those changes). And those parts are
> not
> >> >> >> going to make it into v13 ...
> >> >> >
> >> >> >OK, no problem.
> >> >> >Let's do this for v14.
> >> >> >
> >> >>
> >> >> Hi Alexander,
> >> >>
> >> >> Are you still interested in reviewing those patches? I'll take a
> look at
> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
> >> >> have any comments about 0004 / 0005, which I think are the more
> >> >> interesting parts of this series?
> >> >>
> >> >>
> >> >> Attached is a rebased version - I realized I forgot to include 0005
> in
> >> >> the last update, for some reason.
> >> >>
> >> >
> >> >I've done a quick test with this patch set. I wonder if we can improve
> >> >brin_page_items() SQL function in pageinspect as well. Currently,
> >> >brin_page_items() is hard-coded to support only normal brin indexes.
> >> >When we pass brin-bloom or brin-multi-range to that function the
> >> >binary values are shown in 'value' column but it seems not helpful for
> >> >users. For instance, here is an output of brin_page_items() with a
> >> >brin-multi-range index:
> >> >
> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> >> >2), 'mul');
> >> >-[ RECORD 1
> >>
> ]--
> >>
> >>
> >---
> >> >
> >> >itemoffset  | 1
> >> >blknum  | 0
> >> >attnum  | 1
> >> >allnulls| f
> >> >hasnulls| f
> >> >placeholder | f
> >> >value   |
> >>
> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> >>
> >>
> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> >> >071}
> >> >
> >>
> >> Hmm. I'm not sure we can do much better, without making the function
> >> much more complicated. I mean, even with regular BRIN indexes we don't
> >> really know if the value is plain min/max, right?
> >>
> >You can be sure with the next node. The value is in can be false positiv.
> >The value is out is clear. You can detect the change between in and out.
> >
>
> I'm sorry, I don't understand what you're suggesting. How is any of this
> related to false positive rate, etc?
>

Hi,

You check by the bloom filter if a value you're searching is part of the
node, right?

In case, the value is in the bloom filter you could be mistaken, because
another value could have the same hash profile, no?

However if the value is out, the filter can not react. You can be sure that
the value is out.

If you looking for a range or many ranges of values, you traverse many
nodes. By knowing the value is out, you can state a clear set of nodes that
form the range. However the border is somehow unsharp because of the false
positives.

I am not sure if we write about the same. Please confirm, this can be
needed. Please.

I will try to understand what you write. Interesting

Sascha

>
> The problem here is that while plain BRIN opclasses have fairly simple
> summary that can be stored using a fixed number of simple data types
> (e.g. minmax will store two values with the same data types as the
> indexd column)
>
>  result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
>   sizeof(MinmaxOpaque));
>  result->oi_nstored = 2;
>  result->oi_opaque = (MinmaxOpaque *)
>  MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
>  result->oi_typcache[0] = result->oi_typcache[1] =
>  lookup_type_cache(typoid, 0);
>
> The opclassed introduced here have somewhat more complex summary, stored
> as a single bytea value - which is what gets printed by brin_page_items.
>
> To print something easier to read (for humans) we'd either have to teach
> brin_page_items about the diffrent opclasses (multi-range, 

Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:

> There now seems to be some suggestions that not only should we have a
> new GUC, but we should default to having it not be equal to work_mem (or
> 1.0 or whatever) and instead by higher, to be *twice* or larger whatever
> the existing work_mem setting is- meaning that people whose systems are
> working just fine and have good estimates that represent their workload
> and who get the plans they want may then start seeing differences and
> increased memory utilization in places that they *don't* want that, all
> because we're scared that someone, somewhere, might see a regression due
> to HashAgg spilling to disk.
>

If that increased memory footprint allows the planner to give me a better
plan with faster execution and with no OOM I'd be very happy that this
change happened. While having a more flexible memory allocation framework
is not a primary goal in and of itself it is a nice side-effect.  I'm not
going to say "let's only set work_mem to 32MB instead of 48MB so I can
avoid this faster HashAgg node and instead execute a nested loop (or
whatever)".  More probable is the user whose current nested loop plan is
fast enough and doesn't even realize that with a bit more memory they could
get an HashAgg that performs 15% faster.  For them this is a win on its
face.

I don't believe this negatively impacts the super-admin in our user-base
and is a decent win for the average and below average admin.

Do we really have an issue with plans being chosen while having access to
more memory being slower than plans chosen while having less memory?

The main risk here is that we choose for a user to consume more memory than
they expected and they report OOM issues to us.  We tell them to set this
new GUC to 1.0.  But that implies they are getting many non-HashAgg plans
produced when with a bit more memory those HashAgg plans would have been
chosen.  If they get those faster plans without OOM it's a win, if it OOMs
it's a loss.  I'm feeling optimistic here and we'll get considerably more
wins than losses.  How loss-averse do we need to be here though?  Npte we
can give the upgrading user advance notice of our loss-aversion level and
they can simply disagree and set it to 1.0 and/or perform more thorough
testing.  So being optimistic feels like the right choice.

David J.


Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:
>
> On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > * Btree-implementation contains btree specific code to implement amskip,
> >   introduced in the previous patch.
>
> The way that you're dealing with B-Tree tuples here needs to account
> for posting list tuples:
>
> > +   currItem = &so->currPos.items[so->currPos.lastItem];
> > +   itup = (IndexTuple) (so->currTuples + 
> > currItem->tupleOffset);
> > +   nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

> > +   /*
> > +* To check if we returned the same tuple, try to find a
> > +* startItup on the current page. For that we need to update
> > +* scankey to match the whole tuple and set nextkey to 
> > return
> > +* an exact tuple, not the next one. If the nextOffset is 
> > the
> > +* same as before, it means we are in the loop, return 
> > offnum
> > +* to the original position and jump further
> > +*/
>
> Why does it make sense to use the offset number like this? It isn't
> stable or reliable. The patch goes on to do this:
>
> > +   startOffset = _bt_binsrch(scan->indexRelation,
> > + so->skipScanKey,
> > + so->currPos.buf);
> > +
> > +   page = BufferGetPage(so->currPos.buf);
> > +   maxoff = PageGetMaxOffsetNumber(page);
> > +
> > +   if (nextOffset <= startOffset)
> > +   {
>
> Why compare a heap TID's offset number (an offset number for a heap
> page) to another offset number for a B-Tree leaf page? They're
> fundamentally different things.

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.




Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov
> On Fri, Jul 10, 2020 at 05:03:37PM +, Floris Van Nee wrote:
>
> Also took another look at the patch now, and found a case of incorrect
> data. It looks related to the new way of creating the paths, as I
> can't recall seeing this in earlier versions.
>
> create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
> 10) a, generate_series(1,100) b;
> create index on t1 (a,b,c);
>
> postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
>   QUERY PLAN
> ---
>  Sort  (cost=2.92..2.94 rows=10 width=20)
>Sort Key: a, b DESC, c
>->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
>  Skip scan: true
> (4 rows)

Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.




Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-11 Thread Mark Dilger



> On Jul 10, 2020, at 9:44 PM, Michael Paquier  wrote:
> 
> On Wed, Jul 08, 2020 at 10:00:47PM +0900, Michael Paquier wrote:
>> There are two patches on this thread v2-0001 being much smaller than
>> v2-0002.  I have looked at 0001 for now, and, like Alvaro, this
>> renaming makes sense to me.  Those commands work on objects that are
>> relkinds, except for one OBJECT_TYPE.  So, let's get 0001 patch
>> merged.  Any objections from others?
> 
> I have been through this one again and applied it as cc35d89.
> --
> Michael

Thanks for committing!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Have you got a better proposal that is reasonably implementable for v13?
> > (I do not accept the argument that "do nothing" is a better proposal.)

> So, no, I don't agree that 'do nothing' (except ripping out the one GUC
> that was already added) is a worse proposal than adding another work_mem
> like thing that's only for some nodes types.

The question was "Have you got a better proposal that is reasonably
implementable for v13?".

This is anecdotal, but just today somebody on Twitter reported
*increasing* work_mem to stop getting OOMs from group aggregate +
sort:

https://twitter.com/theDressler/status/1281942941133615104

It was possible to fix the problem in this instance, since evidently
there wasn't anything else that really did try to consume ~5 GB of
work_mem memory. Evidently the memory isn't available in any general
sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
this server just as soon as there is a real need to do a ~5GB sort,
regardless of anything else.

I don't think that this kind of perverse effect is uncommon. Hash
aggregate can naturally be far faster than group agg + sort, Hash agg
can naturally use a lot less memory in many cases, and we have every
reason to think that grouping estimates are regularly totally wrong.
You're significantly underestimating the risk.

--
Peter Geoghegan




Re: "tuple concurrently updated" in pg_restore --jobs

2020-07-11 Thread Tom Lane
Justin Pryzby  writes:
> On Fri, Jul 10, 2020 at 05:36:28PM -0400, Tom Lane wrote:
>> I'm not sure how far to back-patch it -- I think the parallel restore
>> of ACLs behavior is not very old, but we might want to teach older
>> pg_dump versions to insert the extra dependency anyway, for safety.

> Yes, and the test case in David's patch on other thread [0] can't be
> backpatched further than this patch is.

Actually, the answer seems to be that we'd better back-patch all the way,
because this is a live bug much further back than I'd guessed. pg_restore
is willing to run these ACL restores in parallel in all active branches.
The given test case only shows a failure back to 9.6, because older
versions don't dump ACLs on system catalogs; but of course you can just
try it with a user table instead.

Oddly, I could not get the "tuple concurrently updated" syndrome to
appear on 9.5.  Not sure why not; the GRANT/REVOKE code looks the
same as in 9.6.  What I *could* demonstrate in 9.5 is that sometimes
the post-restore state is flat out wrong: the column-level grants go
missing, presumably as a result of the table-level REVOKE executing
after the column-level GRANTs.  Probably that syndrome occurs sometimes
in later branches too, depending on timing; but I didn't look.

regards, tom lane




Re: jsonpath versus NaN

2020-07-11 Thread Tom Lane
Alexander Korotkov  writes:
> On Thu, Jul 9, 2020 at 4:04 AM Alexander Korotkov  
> wrote:
>> I understand both patches as fixes and propose to backpatch them to 12
>> if no objections.

> Both patches are pushed.

Thanks for taking care of that!

regards, tom lane




Re: [PATCH] Remove Extra palloc Of raw_buf For Binary Format In COPY FROM

2020-07-11 Thread Tom Lane
vignesh C  writes:
> On Tue, Jun 30, 2020 at 2:41 PM Bharath Rupireddy
>  wrote:
>> I've added this patch to commitfest - https://commitfest.postgresql.org/28/.

> I felt this patch is ready for committer, changing the status to ready
> for committer.

Pushed with some fiddling.  Mainly, if we're going to the trouble of
checking for binary mode here, we might as well skip allocating the
line_buf too.

regards, tom lane




Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-11 Thread Mark Dilger



> On Jul 10, 2020, at 11:00 PM, Michael Paquier  wrote:
> 
> On Wed, Jul 01, 2020 at 05:04:19PM -0700, Mark Dilger wrote:
>> Most of the work in this patch is mechanical replacement of if/else
>> if/else statements which hinge on relkind to switch statements on
>> relkind.  The patch is not philosophically very interesting, but it
>> is fairly long.  Reviewers might start by scrolling down the patch
>> to the changes in src/include/catalog/pg_class.h 
>> 
>> There are no intentional behavioral changes in this patch.
> 
> Please note that 0002 does not apply anymore, there are conflicts in
> heap.c and tablecmds.c, and that there are noise diffs, likely because
> you ran pgindent and included places unrelated to this patch.

I can resubmit, but should like to address your second point before bothering...

> Anyway,
> I am not really a fan of this patch.  I could see a benefit in
> switching to an enum so as for places where we use a switch/case
> without a default we would be warned if a new relkind gets added or if
> a value is not covered, but then we should not really need
> RELKIND_NULL, no?

There are code paths where relkind is sometimes '\0' under normal, 
non-exceptional conditions.  This happens in

allpaths.c: set_append_rel_size
rewriteHandler.c: view_query_is_auto_updatable
lockcmds.c: LockViewRecurse_walker
pg_depend.c: getOwnedSequences_internal

Doesn't this justify having RELKIND_NULL in the enum?

It is not the purpose of this patch to change the behavior of the code.  This 
is just a structural patch, using an enum and switches rather than char and 
if/else if/else blocks.

Subsequent patches could build on this work, such as changing the behavior when 
code encounters a relkind value outside the code's expected set of relkind 
values.  Whether those patches would add Assert()s, elog()s, or ereport()s is 
not something I'd like to have to debate as part of this patch submission.  
Assert()s have the advantage of costing nothing in production builds, but 
elog()s have the advantage of protecting against corrupt relkind values at 
runtime in production.

Getting the compiler to warn when a new relkind is added to the enumeration but 
not handled in a switch is difficult.  One strategy is to add -Wswitch-enum, 
but that would require refactoring switches over all enums, not just over the 
RelKind enum, and for some enums, that would require a large number of extra 
lines to be added to the code.  Another strategy is to remove the default label 
from switches over RelKind, but that removes protections against invalid 
relkinds being encountered.

Do you have a preference about which directions I should pursue?  Or do you 
think the patch idea itself is dead?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Reducing WaitEventSet syscall churn

2020-07-11 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 13 Mar 2020, at 08:21, Kyotaro Horiguchi  wrote:
>> The attached are:
>> 0001-0004 Not changed
>> 0005 Fix interface of PQregisterEventProc
>> 0006 Add new libpq event for this use.
>> 0007 Another version of "0006 Reuse a WaitEventSet in
>> libpqwalreceiver.c" based on libpq event.
>> 0008-0011 Not changed (old 0007-0010, blindly appended)

> Since 0001 has been applied already in 9b8aa0929390a, the patchtester is 
> unable
> to make heads or tails with trying this patchset.  Can you please submit a
> rebased version without the already applied changes?

While I've not really looked at this patchset, I did happen to notice
0005, and I think that's utterly unacceptable as written.  You can't
break API/ABI on a released libpq function.  I suppose the patch is
assuming that an enum return value is ABI-equivalent to int, but that
assumption is faulty.  Moreover, what's the point?  None of the later
patches require this AFAICS.  (The patch could probably be salvaged
ABI-wise by making the error codes be integer #define's not an enum,
but I fail to see the point of changing this at all.  I also don't
much like the idea of allowing callers to assume that there is a fixed
set of possible failure conditions for PQregisterEventProc.  If we
wanted to return error details, it'd likely be better to say that an
error message is left in conn->errorMessage.)

0006 is an even more egregious ABI break; you can't renumber existing
enum values without breaking applications.  That in itself could be
fixed by putting the new value at the end.  But I'd still object very
strongly to 0006, because I do not think that it's acceptable to
have pqDropConnection correspond to an application-visible event.
We use that all over the place for cases that should not be
application-visible, for example when deciding that a connection attempt
has failed and moving on to the next candidate host.  We already have
PGEVT_CONNRESET and PGEVT_CONNDESTROY as application-visible connection
state change events, and I don't see why those aren't sufficient.

(BTW, even if these weren't ABI breaks, where are the documentation
changes to go with them?)

regards, tom lane




Re: Does TupleQueueReaderNext() really need to copy its result?

2020-07-11 Thread Soumyadeep Chakraborty
Hey Thomas,

On Fri, Jul 10, 2020 at 7:30 PM Thomas Munro  wrote:
>
> > I could reproduce the speed gain that you saw for a plan with a simple
> > parallel sequential scan. However, I got no gain at all for a parallel
> > hash join and parallel agg query.
>
> Right, it's not going to make a difference when you only send one
> tuple through the queue, like COUNT(*) does.

How silly of me! I should have paid more attention to the rows output
from each worker and that there was a select count(*) on the join query.
Anyway, these are a new set of results:

---

select pg_prewarm('lineitem');
select pg_prewarm('orders');
-- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G

explain analyze select *
  from lineitem
  join orders on l_orderkey = o_orderkey
 where o_totalprice > 5.00;

[w/o any patch]   637s
[w/ first patch]  635s
[w/ last minimal tuple patch] 568s

---

We do indeed get the speedup.

> > As for gather merge, is it possible to have a situation where the slot
> > input to tqueueReceiveSlot() is a heap slot (as would be the case for a
> > simple select *)? If yes, in those scenarios, we would be incurring an
> > extra call to minimal_tuple_from_heap_tuple() because of the extra call
> > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
> > And since, in a gather merge, we can't avoid the copy on the leader side
> > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
> > doing extra work in that scenario. I couldn't come up with a plan that
> > creates a scenario like this however.
>
> Hmm.  I wish we had a way to do an "in-place" copy-to-minimal-tuple
> where the caller supplies the memory, with some fast protocol to get
> the size right.  We could use that for copying tuples into shm queues,
> hash join tables etc without an extra palloc()/pfree() and double
> copy.

Do you mean that we should have an implementation for
get_minimal_tuple() for the heap AM and have it return a pointer to the
minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
tqueueReceiveSlot() will ensure that the heap tuple from which it wants
to extract the minimal tuple was allocated in the tuple queue in the
first place? If we consider that the node directly below a gather is a
SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
ss_ScanTupleSlot to point to memory in the shared tuple queue?
Similarly, other ExecInit*() methods can do the same for other executor
nodes that involve parallelism? Of course, things would be slightly
different for
the other use cases you mentioned (such as hash table population)

All things considered, I think the patch in its current form should go
in. Having the in-place copy, could be done as a separate patch? Do you
agree?

Regards,
Soumyadeep (VMware)




Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-11 Thread Tom Lane
Mark Dilger  writes:
> On Jul 10, 2020, at 11:00 PM, Michael Paquier  wrote:
>> I am not really a fan of this patch.  I could see a benefit in
>> switching to an enum so as for places where we use a switch/case
>> without a default we would be warned if a new relkind gets added or if
>> a value is not covered, but then we should not really need
>> RELKIND_NULL, no?

> There are code paths where relkind is sometimes '\0' under normal, 
> non-exceptional conditions.  This happens in

>   allpaths.c: set_append_rel_size
>   rewriteHandler.c: view_query_is_auto_updatable
>   lockcmds.c: LockViewRecurse_walker
>   pg_depend.c: getOwnedSequences_internal

> Doesn't this justify having RELKIND_NULL in the enum?

I'd say no.  I think including an intentionally invalid value in such
an enum is horrid, mainly because it will force a lot of places to cover
that value when they shouldn't (or else draw "enum value not handled in
switch" warnings).  The confusion factor about whether it maybe *is*
a valid value is not to be discounted, either.

If we can't readily get rid of the use of '\0' in these code paths,
maybe trying to convert to an enum isn't going to be a win after all.

> Getting the compiler to warn when a new relkind is added to the
> enumeration but not handled in a switch is difficult.

We already have a project policy about how to do that.

regards, tom lane




Re: output columns of \dAo and \dAp

2020-07-11 Thread Tom Lane
Alexander Korotkov  writes:
> The proposed patch is attached.  This patch is fixes two points:
>  * Adds strategy number and purpose to output of \dAo
>  * Renames "Left/right arg type" columns of \dAp to "Registered left/right 
> type"

I think that \dAp should additionally be changed to print the
function via "oid::regprocedure", not just proname.  A possible
compromise, if you think that's too wordy, is to do it that
way for "\dAp+" while printing plain proname for "\dAp".

BTW, isn't this:

  "  format ('%%s (%%s, %%s)',\n"
  "CASE\n"
  "  WHEN pg_catalog.pg_operator_is_visible(op.oid) \n"
  "  THEN op.oprname::pg_catalog.text \n"
  "  ELSE 
o.amopopr::pg_catalog.regoper::pg_catalog.text \n"
  "END,\n"
  "pg_catalog.format_type(o.amoplefttype, NULL),\n"
  "pg_catalog.format_type(o.amoprighttype, NULL)\n"
  "  ) AS \"%s\"\n,"

just an extremely painful way to duplicate the results of regoperator?
(You could likely remove the joins to pg_proc and pg_operator altogether
if you relied on regprocedure and regoperator casts.)

> I'm not yet convinced we should change the sort key for \dAo.

After playing with this more, I'm less worried about that than
I was.  I think I was concerned that the operator name would
sort ahead of amopstrategy, but now I see that the op name isn't
part of the sort key at all.

BTW, these queries seem inadequately schema-qualified, notably
the format() calls.

regards, tom lane




Re: min_safe_lsn column in pg_replication_slots view

2020-07-11 Thread Alvaro Herrera
On 2020-Jul-11, Tom Lane wrote:

> Alvaro Herrera  writes:
> > On 2020-Jul-09, Alvaro Herrera wrote:
> >> I think we should define InvalidXLogSegNo to be ~((uint64)0) and add a
> >> macro to test for that.
> 
> > That's overkill really.  I just used zero.  Running
> > contrib/test_decoding under valgrind, this now passes.
> 
> > I think I'd rather do away with the compare to zero, and initialize to
> > something else in GetWALAvailability, though.  What we're doing seems
> > unclean and unclear.
> 
> Is zero really not a valid segment number?

No, but you cannot retreat from that ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Support for NSS as a libpq TLS backend

2020-07-11 Thread Daniel Gustafsson
> On 10 Jul 2020, at 07:10, Thomas Munro  wrote:
> 
> On Fri, Jul 3, 2020 at 11:51 PM Daniel Gustafsson  wrote:
>>> On 25 Jun 2020, at 17:39, Daniel Gustafsson  wrote:
 On 15 May 2020, at 22:46, Daniel Gustafsson  wrote:
 The 0001 patch contains the full NSS support, and 0002 is a fix for the 
 pgstat
 abstraction which IMO leaks backend implementation details.  This needs to 
 go
 on it's own thread, but since 0001 fails without it I've included it here 
 for
 simplicity sake for now.
>>> 
>>> The attached 0001 and 0002 are the same patchseries as before, but with the
>>> OpenSSL test module fixed and a rebase on top of the current master.
>> 
>> Another rebase to resolve conflicts with the recent fixes in the SSL tests, 
>> as
>> well as some minor cleanup.
> 
> Hi Daniel,
> 
> Thanks for blazing the trail for other implementations to coexist in
> the tree.  I see that cURL (another project Daniel works on)
> supports a lot of TLS implementations[1].

The list on that URL is also just a selection, the total count is 10 (not
counting OpenSSL forks) IIRC, after axing support for a few lately.  OpenSSL
clearly has a large mindshare but the gist of it is that there exist quite a
few alternatives each with their own strengths.

> I recognise 4 other library
> names from that table as having appeared on this mailing list as
> candidates for PostgreSQL support complete with WIP patches, including
> another one from you (Apple Secure Transport).  I don't have strong
> views on how many and which libraries we should support,

I think it's key to keep in mind *why* it's relevant to provide options in the
first place, after all, as they must be 100% interoperable one can easily argue
for a single one being enough.  We need to to look at what they offer users on
top of just a TLS connection, like: managed certificate storage like for
example macOS Keychains, FIPS certification, good platform availability and/or
OS integration etc.  If all a library offers is "not being OpenSSL" then it's
not clear that we're adding much value by spending the cycles to support it.

My personal opinion is that we should keep it pretty limited, not least to
lessen the burden of testing and during feature development.  Supporting a new
library comes with requirements on both the CFBot as well as the buildfarm, not
to mention on developers who dabble in that area of the code.  The goal should
IMHO be to make it trivial for every postgres installation to use TLS
regardless of platform and experience level with the person installing it.

The situation is a bit different for curl where we have as a goal to provide
enough alternatives such that every platform can have a libcurl/curl more or
less regardless of what it contains.  As a consequence, we have around 80 CI
jobs to test each pull request to provide ample coverage.  Being a kitchen-
sink is really hard work.

> but I was
> curious how many packages depend on libss1.1, libgnutls30 and libnss3
> in the Debian package repos in my sources.list, and I came up with
> OpenSSL = 820, GnuTLS = 342, and NSS = 87.

I don't see a lot of excitement over GnuTLS lately, but Debian shipping it due
to (I believe) licensing concerns with OpenSSL does help it along.  In my
experience, platforms with GnuTLS easily available also have OpenSSL easily
available.

> I guess Solution.pm needs at least USE_NSS => undef for this not to
> break the build on Windows.

Thanks, I'll fix (I admittedly haven't tried this at all on Windows yet).

> Obviously cfbot is useless for testing this code, since its build
> script does --with-openssl and you need --with-nss,

Right, this is a CFBot problem with any patch that require specific autoconf
flags to be excercised.  I wonder if we can make something when we do CF app
integration which can inject flags to a Travis pipeline in a safe manner?

> but it still shows
> us one thing: with your patch, a --with-openssl build is apparently
> broken:
> 
> /001_ssltests.pl .. 1/93 Bailout called.  Further testing stopped:
> system pg_ctl failed

Humm ..  I hate to say "it worked on my machine" but it did, but my TLS
environment is hardly standard.  Sorry for posting breakage, most likely this
is a bug in the new test module structure that the patch introduce in order to
support multiple backends for src/tests/ssl.  I'll fix.

> There are some weird configure-related hunks in the patch:
> 
> +  -runstatedir | --runstatedir | --runstatedi | --runstated \
> ...[more stuff like that]...
> 
> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 
> 31))
> 
> I see the same when I use Debian's autoconf, but not FreeBSD's or
> MacPorts', despite all being version 2.69.  That seems to be due to
> non-upstreamed changes added by the Debian maintainers (I see the
> off_t thing mentioned in /usr/share/doc/autoconf/changelog.Debian.gz).
> I think you need to build a stoc

Re: proposal: possibility to read dumped table's name from file

2020-07-11 Thread Justin Pryzby
On Mon, Jul 06, 2020 at 06:34:15AM +0200, Pavel Stehule wrote:
> >> > Comment support is still missing but easily added :)
> >> Still missing from the latest patch.
> >
> > I can implement a comment support. But I am not sure about the format. The
> 
> here is support for comment's line - first char should be #

Thanks, that's how I assumed it would look.

> >> With some added documentation, I think this can be RfC.

Do you want to add any more documentation ?  

Few more things:

> +exit_invalid_filter_format(FILE *fp, char *filename, char *message, char 
> *line, int lineno)
> +{
> + pg_log_error("invalid format of filter file \"%s\": %s",
> +  *filename == '-' ? "stdin" : filename,
> +  message);

You refer to as "stdin" any filename beginning with -.

I think you could just print "-" and not "stdin".
In any case, *filename=='-' is wrong since it only checks filename[0].
In a few places you compare ==stdin (which is right).

Also, I think "f" isn't as good a name as "fp".

You're adding 139 lines to a 600 line main(), and no other option is more than
15 lines.  Would you put it in a separate function ?

-- 
Justin




Re: pg_dump bug for extension owned tables

2020-07-11 Thread Andrew Dunstan


On 6/26/20 2:10 PM, Fabrízio de Royes Mello wrote:
>
> On Fri, Jun 26, 2020 at 11:55 AM Fabrízio de Royes Mello
> mailto:fabriziome...@gmail.com>> wrote:
> >
> >
> > On Fri, Jun 26, 2020 at 11:24 AM Andrew Dunstan
>  > wrote:
> > >
> > >
> > > On 6/26/20 9:57 AM, Andrew Dunstan wrote:
> > > > It appears that for extension owned tables tbinfo.attgenerated isn't
> > > > being properly populated, so line 2050 in REL_12_STABLE, which
> is line
> > > > 2109 in git tip, is failing.
> > > >
> > > >
> > >
> > > Should have mentioned this is in src/bin/pg_dump/pg_dump.c
> > >
> >
> > Having a look on it.
> >
>
> Seems when qualify the schemaname the the "tbinfo->interesting" field
> is not setted for extensions objects, so the getTableAttrs can't fill
> the attgenerated field properly.
>
> I'm not 100% sure it's the correct way but the attached patch works
> for me and all tests passed. Maybe we should add more TAP tests?
>
>


I just tried this patch out on master, with the test case I gave
upthread. It's not working, still getting a segfault.


Will look closer.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 09:49:43AM -0700, Peter Geoghegan wrote:

On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)



So, no, I don't agree that 'do nothing' (except ripping out the one GUC
that was already added) is a worse proposal than adding another work_mem
like thing that's only for some nodes types.


The question was "Have you got a better proposal that is reasonably
implementable for v13?".

This is anecdotal, but just today somebody on Twitter reported
*increasing* work_mem to stop getting OOMs from group aggregate +
sort:

https://twitter.com/theDressler/status/1281942941133615104

It was possible to fix the problem in this instance, since evidently
there wasn't anything else that really did try to consume ~5 GB of
work_mem memory. Evidently the memory isn't available in any general
sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
this server just as soon as there is a real need to do a ~5GB sort,
regardless of anything else.



I find that example rather suspicious. I mean, what exactly in the
GroupAgg plan would consume this memory? Surely it'd have to be some
node below the grouping, but sort shouldn't do that, no?

Seems strange.


I don't think that this kind of perverse effect is uncommon. Hash
aggregate can naturally be far faster than group agg + sort, Hash agg
can naturally use a lot less memory in many cases, and we have every
reason to think that grouping estimates are regularly totally wrong.
You're significantly underestimating the risk.



I agree grouping estimates are often quite off, and I kinda agree with
introducing hash_mem (or at least with the concept that hashing is more
sensitive to amount of memory than sort). Not sure it's the right espace
hatch to the hashagg spill problem, but maybe it is.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:

Tomas Vondra  schrieb am Sa., 11. Juli 2020,
13:24:


On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>Tomas Vondra  schrieb am Fr., 10. Juli
2020,
>14:09:
>
>> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
tomas.von...@2ndquadrant.com>
>> wrote:
>> >>
>> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> > wrote:
>> >> ...
>> >> >> >
>> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >inclined to rush on these three as well.  But you're willing to
>> commit
>> >> >> >them, you can count round of review on me.
>> >> >> >
>> >> >>
>> >> >> I have no intention to get 0001-0003 committed. I think those
changes
>> >> >> are beneficial on their own, but the primary reason was to support
>> the
>> >> >> new opclasses (which require those changes). And those parts are
not
>> >> >> going to make it into v13 ...
>> >> >
>> >> >OK, no problem.
>> >> >Let's do this for v14.
>> >> >
>> >>
>> >> Hi Alexander,
>> >>
>> >> Are you still interested in reviewing those patches? I'll take a
look at
>> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> have any comments about 0004 / 0005, which I think are the more
>> >> interesting parts of this series?
>> >>
>> >>
>> >> Attached is a rebased version - I realized I forgot to include 0005
in
>> >> the last update, for some reason.
>> >>
>> >
>> >I've done a quick test with this patch set. I wonder if we can improve
>> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >When we pass brin-bloom or brin-multi-range to that function the
>> >binary values are shown in 'value' column but it seems not helpful for
>> >users. For instance, here is an output of brin_page_items() with a
>> >brin-multi-range index:
>> >
>> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >2), 'mul');
>> >-[ RECORD 1
>>
]--
>>
>>
>---
>> >
>> >itemoffset  | 1
>> >blknum  | 0
>> >attnum  | 1
>> >allnulls| f
>> >hasnulls| f
>> >placeholder | f
>> >value   |
>>
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
>>
>>
>70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>> >071}
>> >
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>>
>You can be sure with the next node. The value is in can be false positiv.
>The value is out is clear. You can detect the change between in and out.
>

I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?



Hi,

You check by the bloom filter if a value you're searching is part of the
node, right?

In case, the value is in the bloom filter you could be mistaken, because
another value could have the same hash profile, no?

However if the value is out, the filter can not react. You can be sure that
the value is out.

If you looking for a range or many ranges of values, you traverse many
nodes. By knowing the value is out, you can state a clear set of nodes that
form the range. However the border is somehow unsharp because of the false
positives.

I am not sure if we write about the same. Please confirm, this can be
needed. Please.



Probably not. Masahiko-san pointed out that pageinspect (which also has
a function to print pages from a BRIN index) does not understand the
summary of the new opclasses and just prints the bytea verbatim.

That has nothing to do with inspecting the bloom filter, or anything
like that. So I think there's some confusion ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: GSSENC'ed connection stalls while reconnection attempts.

2020-07-11 Thread Tom Lane
I wrote:
> Kyotaro Horiguchi  writes:
>> If psql connected using GSSAPI auth and server restarted, reconnection
>> sequence stalls and won't return.

> Yeah, reproduced here.  (I wonder if there's any reasonable way to
> exercise this scenario in src/test/kerberos/.)

I tried writing such a test based on the IO::Pty infrastructure used
by src/bin/psql/t/010_tab_completion.pl, as attached.  It works, but
it feels pretty grotty, especially seeing that so much of the patch
is copy-and-pasted from 010_tab_completion.pl.  I think if we want
to have a test like this, it'd be good to work a little harder on
refactoring so that more of that code can be shared.  My Perl skillz
are a bit weak for that, though.

regards, tom lane

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index b3aeea9574..552d2724e7 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -13,17 +13,30 @@
 
 use strict;
 use warnings;
+
 use TestLib;
 use PostgresNode;
 use Test::More;
+use IPC::Run qw(pump finish timer);
+use Data::Dumper;
 
-if ($ENV{with_gssapi} eq 'yes')
+# Do nothing unless Makefile has told us that the build is --with-gssapi.
+if ($ENV{with_gssapi} ne 'yes')
 {
+	plan skip_all => 'GSSAPI/Kerberos not supported by this build';
+}
+
+# If we don't have IO::Pty, we can't run the test with interactive_psql.
+my $have_pty = 1;
+eval { require IO::Pty; };
+if ($@)
+{
+	$have_pty = 0;
 	plan tests => 18;
 }
 else
 {
-	plan skip_all => 'GSSAPI/Kerberos not supported by this build';
+	plan tests => 22;
 }
 
 my ($krb5_bin_dir, $krb5_sbin_dir);
@@ -275,6 +288,77 @@ test_query(
 	"gssencmode=require",
 	"sending 100K lines works");
 
+# Test that libpq can reconnect after a server restart.
+if ($have_pty)
+{
+	# fire up an interactive psql session
+	my $in  = '';
+	my $out = '';
+
+	my $timer = timer(5);
+
+	my $h = $node->interactive_psql(
+		'postgres',
+		\$in,
+		\$out,
+		$timer,
+		extra_params => [
+			'-XAtd',
+			$node->connstr('postgres')
+			  . " host=$host hostaddr=$hostaddr gssencmode=require user=test1"
+		]);
+
+	like(
+		$out,
+		qr/GSSAPI-encrypted connection/,
+		"startup banner shows GSSAPI encryption");
+
+	# subroutine to send a command and wait for response
+	sub interactive_command
+	{
+		my ($send, $pattern, $annotation) = @_;
+
+		# report test failures from caller location
+		local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+		# reset output collector
+		$out = "";
+		# restart per-command timer
+		$timer->start(5);
+		# send the data to be sent
+		$in .= $send;
+		# wait ...
+		pump $h until ($out =~ $pattern || $timer->is_expired);
+		my $okay = ($out =~ $pattern && !$timer->is_expired);
+		ok($okay, $annotation);
+		# for debugging, log actual output if it didn't match
+		local $Data::Dumper::Terse = 1;
+		local $Data::Dumper::Useqq = 1;
+		diag 'Actual output was '
+		  . Dumper($out)
+		  . "Did not match \"$pattern\"\n"
+		  if !$okay;
+		return;
+	}
+
+	interactive_command("SELECT 2+2;\n", qr/4/, "interactive psql is alive");
+
+	$node->restart;
+
+	interactive_command(
+		"SELECT 2+2;\n",
+		qr/The connection to the server was lost.*GSSAPI-encrypted connection/s,
+		"startup banner shows GSSAPI encryption after reconnection");
+
+	interactive_command("SELECT 20+22;\n", qr/42/, "psql is still alive");
+
+	# send psql an explicit \q to shut it down, else pty won't close properly
+	$timer->start(5);
+	$in .= "\\q\n";
+	finish $h or die "psql returned $?";
+	$timer->reset;
+}
+
 unlink($node->data_dir . '/pg_hba.conf');
 $node->append_conf('pg_hba.conf',
 	qq{hostgssenc all all $hostaddr/32 gss map=mymap});


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Fri, Jul 10, 2020 at 10:00 PM David Rowley  wrote:
> hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> it differently.  The problem is made worse by the fact that we'll only
> release the memory for the hash table during ExecEndHashJoin(). If the
> planner had some ability to provide the executor with knowledge that
> the node would never be rescanned, then the executor could release the
> memory for the hash table after the join is complete. For now, we'll
> need to live with the fact that an Append containing many children
> doing hash joins will mean holding onto all that memory until the
> executor is shutdown :-(
>
> There's room to make improvements there, for sure, but not for PG13.

I think that we're stuck with the idea that partitionwise join uses up
to one work_mem allocation per partition until we deprecate work_mem
as a concept.

Anyway, I only talked about partitionwise join because that was your
example. I could just as easily have picked on parallel hash join
instead, which is something that I was involved in myself (kind of).
This is more or less a negative consequence of the incremental
approach we have taken here, which is a collective failure.

I have difficulty accepting that something like hash_mem_multiplier
cannot be accepted because it risks making the consequence of
questionable designs even worse. The problem remains that the original
assumption just isn't very robust, and isn't something that the user
has granular control over. In general it makes sense that a design in
a stable branch is assumed to be the norm that new things need to
respect, and not the other way around. But there must be some limit to
how far that's taken.

> It sounds interesting, but it also sounds like a new feature
> post-beta. Perhaps it's better we minimise the scope of the change to
> be a minimal fix just for the behaviour we predict some users might
> not like.

That's an understandable interpretation of the
hash_mem/hash_mem_multiplier proposal on the table, and yet one that I
disagree with. I consider it highly desirable to have a GUC that can
be tuned in a generic and high level way, on general principle. We
don't really do escape hatches, and I'd rather avoid adding one now
(though it's far preferable to doing nothing, which I consider totally
out of the question).

Pursuing what you called hashagg_mem is a compromise that will make
neither of us happy. It seems like an escape hatch by another name. I
would rather just go with your original proposal instead, especially
if that's the only thing that'll resolve the problem in front of us.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 4:23 PM Tomas Vondra
 wrote:
> I find that example rather suspicious. I mean, what exactly in the
> GroupAgg plan would consume this memory? Surely it'd have to be some
> node below the grouping, but sort shouldn't do that, no?
>
> Seems strange.

Well, I imagine hash aggregate manages to use much less memory than
the equivalent groupagg's sort, even though to the optimizer it
appears as if hash agg should end up using more memory (which is not
allowed by the optimizer when it exceeds work_mem, regardless of
whether or not it's faster). It may also be relevant that Hash agg can
use less memory simply by being faster. Going faster could easily
reduce the memory usage for the system as a whole, even when you
assume individual group agg nodes use more memory for as long as they
run. So in-memory hash agg is effectively less memory hungry.

It's not a great example of a specific case that we'd regress by not
having hash_mem/hash_mem_multiplier. It's an overestimate where older
releases accidentally got a bad, slow plan, not an underestimate where
older releases "lived beyond their means but got away with it" by
getting a good, fast plan. ISTM that the example is a good example of
the strange dynamics involved.

> I agree grouping estimates are often quite off, and I kinda agree with
> introducing hash_mem (or at least with the concept that hashing is more
> sensitive to amount of memory than sort). Not sure it's the right espace
> hatch to the hashagg spill problem, but maybe it is.

The hash_mem/hash_mem_multiplier proposal aims to fix the problem
directly, and not be an escape hatch, because we don't like escape
hatches. I think that that probably fixes many or most of the problems
in practice, at least assuming that the admin is willing to tune it.
But a small number of remaining installations may still need a "true"
escape hatch. There is an argument for having both, though I hope that
the escape hatch can be avoided.

-- 
Peter Geoghegan




Re: POC: postgres_fdw insert batching

2020-07-11 Thread Tomas Vondra

On Fri, Jul 10, 2020 at 09:28:44AM +0500, Andrey V. Lepikhov wrote:

On 6/28/20 8:10 PM, Tomas Vondra wrote:

Now, the primary reason why the performance degrades like this is that
while FDW has batching for SELECT queries (i.e. we read larger chunks of
data from the cursors), we don't have that for INSERTs (or other DML).
Every time you insert a row, it has to go all the way down into the
partition synchronously.


You added new fields into the PgFdwModifyState struct. Why you didn't 
reused ResultRelInfo::ri_CopyMultiInsertBuffer field and 
CopyMultiInsertBuffer machinery as storage for incoming tuples?




Because I was focused on speeding-up inserts, and that is not using
CopyMultiInsertBuffer I think. I agree the way the tuples are stored
may be improved, of course.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





StartupMessage parameters - free-form or not?

2020-07-11 Thread Jaka Jančar
Hi,

I wrote a Postgres client and in it I allow the user to specify arbitrary
StartupMessage parameters (Map). This is convenient because
the user can for example set search_path without issuing a separate SET
query or encoding things into the "options" parameter. The protocol
documentation also says that the latter is deprecated and what I'm doing
(if I understand it right) is preferred.

A fellow author of a driver for a different language reminds me that libpq
explicitly enumerates the supported parameters in the docs, and I checked
the code, and indeed there is a whitelist and others are rejected. So
technically, he's correct: it's nowhere documented that sending e.g.
search_path in StartupMessage parameters will work, and for that matter
whether everything that you can set using SET you can also send there.

What is the proper behavior for a driver here:
 1. Whitelist parameters like libpq does, or
 2. Allow the user to send anything, with the understanding it'll work the
same as SET

Thanks!
Jaka


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 09:02:43AM -0700, David G. Johnston wrote:

On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:


There now seems to be some suggestions that not only should we have a
new GUC, but we should default to having it not be equal to work_mem (or
1.0 or whatever) and instead by higher, to be *twice* or larger whatever
the existing work_mem setting is- meaning that people whose systems are
working just fine and have good estimates that represent their workload
and who get the plans they want may then start seeing differences and
increased memory utilization in places that they *don't* want that, all
because we're scared that someone, somewhere, might see a regression due
to HashAgg spilling to disk.



If that increased memory footprint allows the planner to give me a better
plan with faster execution and with no OOM I'd be very happy that this
change happened. While having a more flexible memory allocation framework
is not a primary goal in and of itself it is a nice side-effect.  I'm not
going to say "let's only set work_mem to 32MB instead of 48MB so I can
avoid this faster HashAgg node and instead execute a nested loop (or
whatever)".  More probable is the user whose current nested loop plan is
fast enough and doesn't even realize that with a bit more memory they could
get an HashAgg that performs 15% faster.  For them this is a win on its
face.

I don't believe this negatively impacts the super-admin in our user-base
and is a decent win for the average and below average admin.

Do we really have an issue with plans being chosen while having access to
more memory being slower than plans chosen while having less memory?

The main risk here is that we choose for a user to consume more memory than
they expected and they report OOM issues to us.  We tell them to set this
new GUC to 1.0.  But that implies they are getting many non-HashAgg plans
produced when with a bit more memory those HashAgg plans would have been
chosen.  If they get those faster plans without OOM it's a win, if it OOMs
it's a loss.  I'm feeling optimistic here and we'll get considerably more
wins than losses.  How loss-averse do we need to be here though?  Npte we
can give the upgrading user advance notice of our loss-aversion level and
they can simply disagree and set it to 1.0 and/or perform more thorough
testing.  So being optimistic feels like the right choice.



I don't know, but one of the main arguments against simply suggesting
people to bump up work_mem (if they're hit by the hashagg spill in v13)
was that it'd increase overall memory usage for them. It seems strange
to then propose a new GUC set to a default that would result in higher
memory usage *for everyone*.

Of course, having such GUC with a default a multiple of work_mem might
be a win overall - or maybe not. I don't have a very good idea how many
people will get bitten by this, and how many will get speedup (and how
significant the speedup would be).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: StartupMessage parameters - free-form or not?

2020-07-11 Thread Tom Lane
=?UTF-8?B?SmFrYSBKYW7EjWFy?=  writes:
> I wrote a Postgres client and in it I allow the user to specify arbitrary
> StartupMessage parameters (Map). This is convenient because
> the user can for example set search_path without issuing a separate SET
> query or encoding things into the "options" parameter. The protocol
> documentation also says that the latter is deprecated and what I'm doing
> (if I understand it right) is preferred.

Sure.

> A fellow author of a driver for a different language reminds me that libpq
> explicitly enumerates the supported parameters in the docs, and I checked
> the code, and indeed there is a whitelist and others are rejected.

Not sure what you're looking at, but the issue for libpq is that the set
of "options" that it accepts in connection strings is independent of the
set of backend GUC names (and relatively few of them actually correspond
directly to backend GUCs, either).  I suppose we could make it pass
through unrecognized options, but that would be an unmaintainable mess,
because both sets of names are constantly evolving.

It's a bit of a hack that the backend accepts GUC names directly in
startup messages, but the set of "fixed" parameter names in that context
is very short and has barely changed in decades, so we haven't had
conflict problems.

> technically, he's correct: it's nowhere documented that sending e.g.
> search_path in StartupMessage parameters will work, and for that matter
> whether everything that you can set using SET you can also send there.

protocol.sgml saith (under Message Formats)

In addition to the above, other parameters may be listed. Parameter
names beginning with _pq_. are reserved for use as protocol
extensions, while others are treated as run-time parameters to be set
at backend start time. Such settings will be applied during backend
start (after parsing the command-line arguments if any) and will act
as session defaults.

Admittedly, that doesn't directly define what it means by "run-time
parameter", but what it means is any settable GUC.

regards, tom lane




Re: StartupMessage parameters - free-form or not?

2020-07-11 Thread Jaka Jančar
Excellent, thanks!

On Sat, Jul 11, 2020 at 8:43 PM Tom Lane  wrote:

> =?UTF-8?B?SmFrYSBKYW7EjWFy?=  writes:
> > I wrote a Postgres client and in it I allow the user to specify arbitrary
> > StartupMessage parameters (Map). This is convenient
> because
> > the user can for example set search_path without issuing a separate SET
> > query or encoding things into the "options" parameter. The protocol
> > documentation also says that the latter is deprecated and what I'm doing
> > (if I understand it right) is preferred.
>
> Sure.
>
> > A fellow author of a driver for a different language reminds me that
> libpq
> > explicitly enumerates the supported parameters in the docs, and I checked
> > the code, and indeed there is a whitelist and others are rejected.
>
> Not sure what you're looking at, but the issue for libpq is that the set
> of "options" that it accepts in connection strings is independent of the
> set of backend GUC names (and relatively few of them actually correspond
> directly to backend GUCs, either).  I suppose we could make it pass
> through unrecognized options, but that would be an unmaintainable mess,
> because both sets of names are constantly evolving.
>
> It's a bit of a hack that the backend accepts GUC names directly in
> startup messages, but the set of "fixed" parameter names in that context
> is very short and has barely changed in decades, so we haven't had
> conflict problems.
>
> > technically, he's correct: it's nowhere documented that sending e.g.
> > search_path in StartupMessage parameters will work, and for that matter
> > whether everything that you can set using SET you can also send there.
>
> protocol.sgml saith (under Message Formats)
>
> In addition to the above, other parameters may be listed. Parameter
> names beginning with _pq_. are reserved for use as protocol
> extensions, while others are treated as run-time parameters to be set
> at backend start time. Such settings will be applied during backend
> start (after parsing the command-line arguments if any) and will act
> as session defaults.
>
> Admittedly, that doesn't directly define what it means by "run-time
> parameter", but what it means is any settable GUC.
>
> regards, tom lane
>


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
Tomas Vondra  writes:
> I don't know, but one of the main arguments against simply suggesting
> people to bump up work_mem (if they're hit by the hashagg spill in v13)
> was that it'd increase overall memory usage for them. It seems strange
> to then propose a new GUC set to a default that would result in higher
> memory usage *for everyone*.

It seems like a lot of the disagreement here is focused on Peter's
proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
seem to me that that's a critical element of the proposal.  Why not just
make it default to 1.0, thus keeping the default behavior identical
to what it is now?

If we find that's a poor default, we can always change it later;
but it seems to me that the evidence for a higher default is
a bit thin at this point.

regards, tom lane




Re: proposal: possibility to read dumped table's name from file

2020-07-11 Thread vignesh C
On Mon, Jul 6, 2020 at 10:05 AM Pavel Stehule  wrote:
>
> here is support for comment's line - first char should be #
>

Few comments:
+   str = fgets(*lineptr + total_chars,
+   *n - total_chars,
+   fp);
+
+   if (ferror(fp))
+   return -1;

Should we include any error message in the above case.

+   else
+   break;
+   }
+
+   if (ferror(fp))
+   return -1;

Similar to above.

+   /* check, if there is good enough space for
next content */
+   if (*n - total_chars < 2)
+   {
+   *n += 1024;
+   *lineptr = pg_realloc(*lineptr, *n);
+   }
We could use a macro in place of 1024.

+   if (objecttype == 't')
+   {
+   if (is_include)
+   {
+
simple_string_list_append(&table_include_patterns,
+
   objectname);
+
dopt.include_everything = false;
+   }
+   else
+
simple_string_list_append(&table_exclude_patterns,
+
   objectname);
+   }
+   else if (objecttype == 'n')
+   {
+   if (is_include)
+   {
+
simple_string_list_append(&schema_include_patterns,
+
   objectname);
+
dopt.include_everything = false;
+   }
+   else
+
simple_string_list_append(&schema_exclude_patterns,
+
   objectname);
+   }
Some of the above code is repetitive in above, can the common code be
made into a macro and called?

printf(_("  --extra-float-digits=NUM override default
setting for extra_float_digits\n"));
+   printf(_("  --filter=FILENAMEread object name
filter expressions from file\n"));
printf(_("  --if-exists  use IF EXISTS when
dropping objects\n"));
Can this be changed to dump objects and data based on the filter
expressions from the filter file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
I would be okay with a default of 1.0.

Peter Geoghegan
(Sent from my phone)


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Thomas Munro
On Sun, Jul 12, 2020 at 2:27 AM Tom Lane  wrote:
> David Rowley  writes:
> > hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> > it differently.  The problem is made worse by the fact that we'll only
> > release the memory for the hash table during ExecEndHashJoin(). If the
> > planner had some ability to provide the executor with knowledge that
> > the node would never be rescanned, then the executor could release the
> > memory for the hash table after the join is complete.
>
> EXEC_FLAG_REWIND seems to fit the bill already?

FWIW I have a patch that does exactly that, which I was planning to
submit for CF2 along with some other patches that estimate and measure
peak executor memory usage.




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > I don't know, but one of the main arguments against simply suggesting
> > people to bump up work_mem (if they're hit by the hashagg spill in v13)
> > was that it'd increase overall memory usage for them. It seems strange
> > to then propose a new GUC set to a default that would result in higher
> > memory usage *for everyone*.
>
> It seems like a lot of the disagreement here is focused on Peter's
> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
> seem to me that that's a critical element of the proposal.  Why not just
> make it default to 1.0, thus keeping the default behavior identical
> to what it is now?
>

If we don't default it to something other than 1.0 we might as well just
make it memory units and let people decide precisely what they want to use
instead of adding the complexity of a multiplier.


> If we find that's a poor default, we can always change it later;
> but it seems to me that the evidence for a higher default is
> a bit thin at this point.
>

So "your default is 1.0 unless you installed the new database on or after
13.4 in which case it's 2.0"?

I'd rather have it be just memory units defaulting to -1 meaning "use
work_mem".  In the unlikely scenario we decide post-release to want a
multiplier > 1.0 we can add the GUC with that default at that point.  The
multiplier would want to be ignored if hash_mem if set to anything other
than -1.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
"David G. Johnston"  writes:
> On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
>> It seems like a lot of the disagreement here is focused on Peter's
>> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
>> seem to me that that's a critical element of the proposal.  Why not just
>> make it default to 1.0, thus keeping the default behavior identical
>> to what it is now?

> If we don't default it to something other than 1.0 we might as well just
> make it memory units and let people decide precisely what they want to use
> instead of adding the complexity of a multiplier.

Not sure how that follows?  The advantage of a multiplier is that it
tracks whatever people might do to work_mem automatically.  In general
I'd view work_mem as the base value that people twiddle to control
executor memory consumption.  Having to also twiddle this other value
doesn't seem especially user-friendly.

>> If we find that's a poor default, we can always change it later;
>> but it seems to me that the evidence for a higher default is
>> a bit thin at this point.

> So "your default is 1.0 unless you installed the new database on or after
> 13.4 in which case it's 2.0"?

What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
that we'd change it in a minor release.)

regards, tom lane




Re: ALTER TABLE validate foreign key dependency problem

2020-07-11 Thread David Rowley
On Thu, 9 Jul 2020 at 15:54, David Rowley  wrote:
> I think the fix is just to delay the foreign key validation when
> there's a rewrite pending until the rewrite is complete.

I looked over this again and only slightly reworded a comment.  The
problem exists as far back as 9.5 so I've attached 3 patches that,
pending any objections, I plan to push about 24 hours from now.

> I also considered that we could just delay all foreign key validations
> until phase 3, but I ended up just doing then only when a rewrite is
> pending.

I still wonder if it's best to delay the validation of the foreign key
regardless of if there's a pending table rewrite, but the patch as it
is now only delays if there's a pending rewrite.

David


delay_altertable_foreignkey_validation_during_table_rewrite10_v2.patch
Description: Binary data


delay_altertable_foreignkey_validation_during_table_rewrite95_v2.patch
Description: Binary data


delay_altertable_foreignkey_validation_during_table_rewrite12_v2.patch
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Saturday, July 11, 2020, Tom Lane  wrote:

> "David G. Johnston"  writes:
> > On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
> >> It seems like a lot of the disagreement here is focused on Peter's
> >> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
> >> seem to me that that's a critical element of the proposal.  Why not just
> >> make it default to 1.0, thus keeping the default behavior identical
> >> to what it is now?
>
> > If we don't default it to something other than 1.0 we might as well just
> > make it memory units and let people decide precisely what they want to
> use
> > instead of adding the complexity of a multiplier.
>
> Not sure how that follows?  The advantage of a multiplier is that it
> tracks whatever people might do to work_mem automatically.


>
I was thinking that setting -1 would basically do that.


>   In general
> I'd view work_mem as the base value that people twiddle to control
> executor memory consumption.  Having to also twiddle this other value
> doesn't seem especially user-friendly.


I’ll admit I don’t have a feel for what is or is not user-friendly when
setting these GUCs in a session to override the global defaults.  But as
far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb,
48mb) and (32mb, 1.0) -> (32mb, 1.5)

If you want 96mb for the session/query hash setting it to 96mb is
invariant, whilesetting it to 3.0 means it can change in the future if the
system work_mem changes.  Knowing the multiplier is 1.5 and choosing 64mb
for work_mem in the session is possible but also mutable and has
side-effects.  If the user is going to set both values to make it invariant
we are back to it being a wash.

I don’t believe using a multiplier will promote better comprehension for
why this setting exists compared to “-1 means use work_mem but you can
override a subset if you want.”

Is having a session level memory setting be mutable something we want to
introduce?

Is it more user-friendly?

>> If we find that's a poor default, we can always change it later;
> >> but it seems to me that the evidence for a higher default is
> >> a bit thin at this point.
>
> > So "your default is 1.0 unless you installed the new database on or after
> > 13.4 in which case it's 2.0"?
>
> What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
> that we'd change it in a minor release.)
>

Minor release update is what I had thought, and to an extent was making
possible by not using the multiplier upfront.

I agree options are wide open come v14 and beyond.

David J.


Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-07-11 Thread Julien Rouhaud
Hi,

Currently, getTableAttrs() always retrieves info about columns defaults and
check constraints, while this will never be used if --data-only option if used.
This seems like a waste of resources, so here's a patch to skip those parts
when the DDL won't be generated.
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c50a..e4ea604dfe 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8590,7 +8590,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
/*
 * Get info about column defaults
 */
-   if (hasdefaults)
+   if (!dopt->dataOnly && hasdefaults)
{
AttrDefInfo *attrdefs;
int numDefaults;
@@ -8677,7 +8677,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
/*
 * Get info about table CHECK constraints
 */
-   if (tbinfo->ncheck > 0)
+   if (!dopt->dataOnly && tbinfo->ncheck > 0)
{
ConstraintInfo *constrs;
int numConstrs;