date:20171205

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Thomas Munro

On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
 wrote:
> On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila  wrote:
>> I have one another observation in the somewhat related area.  From the
>> code, it looks like we might have some problem with displaying sort
>> info for workers for rescans.  I think the problem with the sortinfo
>> is that it initializes shared info with local memory in
>> ExecSortRetrieveInstrumentation after which it won't be able to access
>> the values in shared memory changed by workers in rescans.  We might
>> be able to fix it by having some local_info same as sahred_info in
>> sort node.  But the main problem is how do we accumulate stats for
>> workers across rescans.  The type of sort method can change across
>> rescans.  We might be able to accumulate the size of Memory though,
>> but not sure if that is right.  I think though this appears to be
>> somewhat related to the problem being discussed in this thread, it can
>> be dealt separately if we want to fix it.
>
> Yeah, that's broken.  ExecSortRetrieveInstrumentation() is run for
> each loop, and after the first loop we've lost track of the pointer
> into shared memory because we replaced it with palloc'd copy.  We
> could do what you said, or we could reinstate the pointer into the DSM
> in ExecSortReInitializeDSM() by looking it up in the TOC.

Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

2017-12-05 Thread Nikhil Sontakke

>> - Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD
>> if the transaction id is newer than OldestXmin. Doing this only for
>> CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId).
>
>
> Because logical decoding supports user-catalog relations, we need to use the
> same sort of logical that GetOldestXmin uses instead of a simple oid-range
> check. See  RelationIsAccessibleInLogicalDecoding() and the
> user_catalog_table reloption.
>

Unfortunately, HeapTupleSatisfiesVacuum does not have the Relation
structure handily available to allow for these checks..

> Otherwise pseudo-catalogs used by logical decoding output plugins could
> still suffer issues with needed tuples getting vacuumed, though only if the
> txn being decoded made changes to those tables than ROLLBACKed. It's a
> pretty tiny corner case for decoding of 2pc but a bigger one when we're
> addressing streaming decoding.
>

We disallow rewrites on user_catalog_tables, so they cannot change
underneath. Yes, DML can be carried out on them inside a 2PC
transaction which then gets ROLLBACK'ed. But if it's getting aborted,
then we are not interested in that data anyways. Also, now that we
have the "filter_decode_txn_cb_wrapper()" function, we will stop
decoding by the next change record cycle because of the abort.

So, I am not sure if we need to track user_catalog_tables in
HeapTupleSatisfiesVacuum explicitly.

> Otherwise I'm really, really happy with how this is progressing and want to
> find time to play with it.

Yeah, I will do some more testing and add a few more test cases in the
test_decoding plugin. It might be handy to have a DELAY of a few
seconds after every change record processing, for example. That ways,
we can have a TAP test which can do a few WAL activities and then we
introduce a concurrent rollback midways from another session in the
middle of that delayed processing. I have done debugger based testing
of this concurrent rollback functionality as of now.

Another test (actually, functionality) that might come in handy, is to
have a way for DDL to be actually carried out on the subscriber. We
will need something like pglogical.replicate_ddl_command to be added
to the core for this to work. We can add this functionality as a
follow-on separate patch after discussing how we want to implement
that in core.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: Usage of epoch in txid_current

2017-12-05 Thread Alexander Korotkov

Hi!

On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila  wrote:

> Currently, txid_current and friends export a 64-bit format of
> transaction id that is extended with an “epoch” counter so that it
> will not wrap around during the life of an installation.   The epoch
> value it uses is based on the epoch that is maintained by checkpoint
> (aka only checkpoint increments it).
>
> Now if epoch changes multiple times between two checkpoints
> (practically the chances of this are bleak, but there is a theoretical
> possibility), then won't the computation of xids will go wrong?
> Basically, it can give the same value of txid after wraparound if the
> checkpoint doesn't occur between the two calls to txid_current.
>

AFAICS, yes, if epoch changes multiple times between two checkpoints, then
computation will go wrong.  And it doesn't look like purely theoretical
possibility for me, because I think I know couple of instances of the edge
of this...

Am I missing something which ensures that epoch gets incremented at or
> after wraparound?
>

I've checked the code, and it doesn't look for me that there is something
like this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] logical decoding of two-phase transactions

2017-12-05 Thread Craig Ringer

On 5 December 2017 at 16:00, Nikhil Sontakke 
wrote:


>
> We disallow rewrites on user_catalog_tables, so they cannot change
> underneath. Yes, DML can be carried out on them inside a 2PC
> transaction which then gets ROLLBACK'ed. But if it's getting aborted,
> then we are not interested in that data anyways. Also, now that we
> have the "filter_decode_txn_cb_wrapper()" function, we will stop
> decoding by the next change record cycle because of the abort.
>
> So, I am not sure if we need to track user_catalog_tables in
> HeapTupleSatisfiesVacuum explicitly.
>

I guess it's down to whether, when we're decoding a txn that just got
concurrently aborted, the output plugin might do anything with its user
catalogs that could cause a crash.

Output plugins are most likely to be using the genam (or even SPI, I
guess?) to read user-catalogs during logical decoding. Logical decoding its
self does not rely on the correctness of user catalogs in any way, it's
only a concern for output plugin callbacks.

It may make sense to kick this one down the road at this point, I can't
conclusively see where it'd cause an actual problem.


>
> > Otherwise I'm really, really happy with how this is progressing and want
> to
> > find time to play with it.
>
> Yeah, I will do some more testing and add a few more test cases in the
> test_decoding plugin. It might be handy to have a DELAY of a few
> seconds after every change record processing, for example. That ways,
> we can have a TAP test which can do a few WAL activities and then we
> introduce a concurrent rollback midways from another session in the
> middle of that delayed processing. I have done debugger based testing
> of this concurrent rollback functionality as of now.
>
>
Sounds good.


> Another test (actually, functionality) that might come in handy, is to
> have a way for DDL to be actually carried out on the subscriber. We
> will need something like pglogical.replicate_ddl_command to be added
> to the core for this to work. We can add this functionality as a
> follow-on separate patch after discussing how we want to implement
> that in core.


Yeah, definitely a different patch, but assuredly valuable.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] pow support for pgbench

2017-12-05 Thread Fabien COELHO



Hello Tom,


1. A patch that adds an integer version of pow() but not a double version
2. A patch that adds a double version of pow() but not an integer version
3. A patch that adds both an integer version of pow() and a double
version of pow(), with the two versions having different names



It seems to me that 1 and 2 have value on their own for the workloads
tried to be emulated, so what you are suggesting in 3 looks good to
me. Now why are two different function names necessary?


ISTM one key issue here is whether pgbench's expression language is
meant to model SQL (where we have function overloading) or C (where
there is no overloading).  I don't think we've really settled on a
fixed policy on that, but maybe now is the time.


Function overloading is implemented for ABS (as noted by Michaël), but 
also LEAST and GREATEST. Typing is dynamic, based on guessing (think 
"pgbench -D i=1 -D f=1.0") . These decisions have already been taken, they 
were reasonable, the behavior is consistent and useful.


I do not see the point of going backward and breaking compatibility.

The idea is to model after SQL (eg also I've been asked to do that for the 
operators & functions extensions, lost in the CF queue), when possible.

When possible is not always.


If we do think that function overloading is OK, there remains the
question of when the typing is resolved.


Dynamic guessing is the only pragmatic option with pgbench.

I think Robert is objecting to resolving at runtime, and I tend to agree 
that that's something we'd regret in the long run.


Too late. I do not think that we would come to regret it. I'm rather 
regretting that pgbench capabilities are improving so slowly.



It doesn't match either SQL or C.


Sure. A dynamically typed language cannot match a statically typed one. I 
do not see this as a significant issue for writing benchmarking scripts.



Now, about POW:

As for approximating NUMERIC, there is a first difficulty, as the type can 
either be approximated as an INT or DOUBLE. Ah ah.


Also, POW raises another difficulty, which is that depending on the sign 
of the second argument the result is more INT or more DOUBLE. Fun.


So even if we do separate functions (IPOW & DPOW, or POW & DPOW), the 
result cannot be typed statically.


Note that type can be enforced if necessary thanks to int() and double() 
functions, which is enough of a work around for pgbench simple purpose.


In this context, ISTM that Raul patch is a reasonable, even if imperfect, 
compromise.



I can understand that this compromise does not suit Robert. Too bad, 
because both POW versions (int & double) have a reasonable use case for 
writing benchmarks. Using two names does not resolve the typing issue 
anyway. None of the 3 options offered by Robert are really satisfactory, 
and the compromise is also rejected.


So basically do whatever you want with the patch (accept, reject, 1, 2, 
3). Only "Returned with feedback" with the feedback being "please 
implement a statically typed pgbench" does not seem like a useful option.


I can only say that these functions would be useful, so for me it is 
better in than out, but that is just my silly opinion.


--
Fabien.

Re: compress method for spgist - 2

2017-12-05 Thread Darafei Praliaskouski

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

I've read the updated patch and see my concerns addressed.

I'm looking forward to SP-GiST compress method support, as it will allow usage 
of SP-GiST index infrastructure for PostGIS.

The new status of this patch is: Ready for Committer

Re: Usage of epoch in txid_current

2017-12-05 Thread Amit Kapila

On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
 wrote:
> On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila  wrote:
>>
>> Currently, txid_current and friends export a 64-bit format of
>> transaction id that is extended with an “epoch” counter so that it
>> will not wrap around during the life of an installation.   The epoch
>> value it uses is based on the epoch that is maintained by checkpoint
>> (aka only checkpoint increments it).
>>
>> Now if epoch changes multiple times between two checkpoints
>> (practically the chances of this are bleak, but there is a theoretical
>> possibility), then won't the computation of xids will go wrong?
>> Basically, it can give the same value of txid after wraparound if the
>> checkpoint doesn't occur between the two calls to txid_current.
>
>
> AFAICS, yes, if epoch changes multiple times between two checkpoints, then
> computation will go wrong.  And it doesn't look like purely theoretical
> possibility for me, because I think I know couple of instances of the edge
> of this...
>

Okay, it is quite strange that we haven't discovered this problem till
now.  I think we should do something to fix it.  One idea is that we
track epoch change in shared memory (probably in the same data
structure (VariableCacheData) where we track nextXid).  We need to
increment it when the xid wraparound during xid allocation (in
GetNewTransactionId).  Also, we need to make it persistent as which
means we need to log it in checkpoint xlog record and we need to write
a separate xlog record for the epoch change.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Amit Kapila

On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro
 wrote:
> On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
>  wrote:
>> On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila  wrote:
>>> I have one another observation in the somewhat related area.  From the
>>> code, it looks like we might have some problem with displaying sort
>>> info for workers for rescans.  I think the problem with the sortinfo
>>> is that it initializes shared info with local memory in
>>> ExecSortRetrieveInstrumentation after which it won't be able to access
>>> the values in shared memory changed by workers in rescans.  We might
>>> be able to fix it by having some local_info same as sahred_info in
>>> sort node.  But the main problem is how do we accumulate stats for
>>> workers across rescans.  The type of sort method can change across
>>> rescans.  We might be able to accumulate the size of Memory though,
>>> but not sure if that is right.  I think though this appears to be
>>> somewhat related to the problem being discussed in this thread, it can
>>> be dealt separately if we want to fix it.
>>
>> Yeah, that's broken.  ExecSortRetrieveInstrumentation() is run for
>> each loop, and after the first loop we've lost track of the pointer
>> into shared memory because we replaced it with palloc'd copy.  We
>> could do what you said, or we could reinstate the pointer into the DSM
>> in ExecSortReInitializeDSM() by looking it up in the TOC.
>
> Or would it be an option to change the time
> ExecXXXRetrieveInstrumentation() is called so it is run only once?
>

To me, that doesn't sound like a bad option.  I think if do so, then
we don't even need to reinitialize the shared sort stats.  I think
something, like attached, should work if we want to go this route.  We
can add regression test if this is what we think is a good idea.
Having said that, one problem I see doing thing this way is that in
general, we will display the accumulated stats of each worker, but for
sort or some other special nodes (like hash), we will show the
information of only last loop.  I am not sure if that is a matter of
concern, but if we want to do this way, then probably we should
explain this in documentation as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_accum_instr_parallel_workers_v4.patch
Description: Binary data

Re: [HACKERS] kqueue

2017-12-05 Thread Thomas Munro

On Thu, Jun 22, 2017 at 7:19 PM, Thomas Munro
 wrote:
> I don't plan to resubmit this patch myself, but I was doing some
> spring cleaning and rebasing today and I figured it might be worth
> quietly leaving a working patch here just in case anyone from the
> various BSD communities is interested in taking the idea further.

Since there was a mention of kqueue on -hackers today, here's another
rebase.  I got curious just now and ran a very quick test on an AWS 64
vCPU m4.16xlarge instance running image "FreeBSD
11.1-STABLE-amd64-2017-08-08 - ami-00608178".  I set shared_buffers =
10GB and ran pgbench approximately the same way Heikki and Keith did
upthread:

pgbench -i -s 200 postgres
pgbench -M prepared  -j 6 -c 6 -S postgres -T60 -P1
pgbench -M prepared  -j 12 -c 12 -S postgres -T60 -P1
pgbench -M prepared  -j 24 -c 24 -S postgres -T60 -P1
pgbench -M prepared  -j 36 -c 36 -S postgres -T60 -P1
pgbench -M prepared  -j 48 -c 48 -S postgres -T60 -P1

The TPS numbers I got (including connections establishing) were:

clientsmasterpatched
  6   146,215147,535 (+0.9%)
 12   273,056280,505 (+2.7%)
 24   360,751369,965 (+2.5%)
 36   413,147420,769 (+1.8%)
 48   416,189444,537 (+6.8%)

The patch appears to be doing something positive on this particular
system and that effect was stable over a few runs.

-- 
Thomas Munro
http://www.enterprisedb.com

kqueue-v8.patch
Description: Binary data

Re: es_query_dsa is broken

2017-12-05 Thread Amit Kapila

On Tue, Dec 5, 2017 at 6:45 AM, Thomas Munro
 wrote:
> On Thu, Nov 30, 2017 at 11:19 PM, Thomas Munro
>  wrote:
>> On Thu, Nov 30, 2017 at 4:01 AM, Robert Haas  wrote:
>>>
>>> In v10 we might need to go with a solution like what you've sketched
>>> here, because Tom will complain about breaking binary compatibility
>>> with EState (and maybe other PlanState nodes) in a released branch.
>
> Here's a pair of versions of that patch tidied up a bit, for
> REL_10_STABLE and master.  Unfortunately I've been unable to come up
> with a reproducer that shows misbehaviour (something like what was
> reported on -bugs[1] which appears to be a case of calling
> dsa_get_address(wrong_area, some_dsa_pointer)).
>

+ EState *estate = gatherstate->ps.state;
+
+ /* Install our DSA area while executing the plan. */
+ estate->es_query_dsa = gatherstate->pei->area;
  outerTupleSlot = ExecProcNode(outerPlan);
+ estate->es_query_dsa = NULL;

Won't the above coding pattern create a problem, if ExecProcNode
throws an error and outer block catches it and continues execution
(consider the case of execution inside PL blocks)?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pow support for pgbench

2017-12-05 Thread Raúl Marín Rodríguez

Hi all,

I've been giving a thought about this and I think we could reach the
compromise
of having a single function with 2 overloads:
* pow(double, double) -> double: Uses C pow().
* pow(int, int) -> double: Uses ipow() for positive exponents, and pow()
for negative exponents.

In both cases we'd return a double but we use the fast ipow if it's possible
(which can be 20x faster), so at the cost of an extra cast if you need an
int,
we'd have a consistent API. Would this be acceptable?

Re: [HACKERS] pow support for pgbench

2017-12-05 Thread Fabien COELHO




I've been giving a thought about this and I think we could reach the
compromise
of having a single function with 2 overloads:
* pow(double, double) -> double: Uses C pow().
* pow(int, int) -> double: Uses ipow() for positive exponents, and pow()
for negative exponents.

In both cases we'd return a double but we use the fast ipow if it's 
possible (which can be 20x faster), so at the cost of an extra cast if 
you need an int, we'd have a consistent API. Would this be acceptable?


This is for Robert to say whether it is more acceptable to him.

My 0.02€: ISTM that it closely equivalent to having just the double 
version and using an explicit cast to get an int if needed, which does not 
conform anymore to strict SQL behavior than the previous compromise.


Also, probably having something (anything) is better than nothing.

--
Fabien.

Re: [HACKERS] [POC] Faster processing at Gather node

2017-12-05 Thread Rafia Sabih

On Mon, Dec 4, 2017 at 9:20 PM, Robert Haas  wrote:

> On Sun, Dec 3, 2017 at 10:30 PM, Amit Kapila 
> wrote:
> > I thought there are some cases (though less) where we want to Shutdown
> > the nodes (ExecShutdownNode) earlier and release the resources sooner.
> > However, if you are not completely sure about this change, then we can
> > leave it as it.  Thanks for sharing your thoughts.
>
> OK, thanks.  I committed that patch, after first running 100 million
> tuples through a Gather over and over again to test for leaks.
> Hopefully I haven't missed anything here, but it looks like it's
> solid.  Here once again are the remaining patches.  While the
> already-committed patches are nice, these two are the ones that
>

Hi,
I was spending sometime in verifying this memory-leak patch for
gather-merge case and I too found it good. In the query I tried, around 10
million tuples were passed through gather-merge. On analysing the output of
top it looks acceptable memory usage and it gets freed once the query is
completed. Since, I was trying on my local system only, I tested for upto 8
workers and didn't find any memory leaks for the queries I tried.
One may find the attached file for the test-case.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

ml_test_query.sql
Description: Binary data

Speeding up pg_upgrade

2017-12-05 Thread Bruce Momjian

As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades.  After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.

There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster.  One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:

1.  prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster

2.  shut down the old cluster and copy/link the data files

My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Speeding up pg_upgrade

2017-12-05 Thread Stephen Frost

Bruce,

* Bruce Momjian (br...@momjian.us) wrote:
> As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> zero-downtime upgrades.  After the usual discussion of using logical
> replication, Slony, and perhaps having the server be able to read old
> and new system catalogs, we discussed speeding up pg_upgrade.

Sounds familiar.

> There are clusters that take a long time to dump the schema from the old
> cluster and recreate it in the new cluster.  One idea of speeding up
> pg_upgrade would be to allow pg_upgrade to be run in two stages:
> 
> 1.  prevent system catalog changes while the old cluster is running, and
> dump the old cluster's schema and restore it in the new cluster
> 
> 2.  shut down the old cluster and copy/link the data files

Perhaps a bit more complicated, but couldn't we copy/link while the
old cluster is online and in backup mode, finish backup mode, shut down
the old cluster, and then play forward the WAL to catch any relation
extents being added or similar, and then flip to the new PG version?

> My question is whether the schema dump/restore is time-consuming enough
> to warrant this optional more complex API, and whether people would
> support adding a server setting that prevented all system table changes?

When you say 'system table changes', you're referring to basically all
DDL, right?  Just wish to clarify as there might be some confusion
between the terminology you're using here and allow_system_table_mods.

Would we need to have autovacuum shut down too..?

The other concern is if there's changes made to the catalogs by non-DDL
activity that needs to be addressed too (logical replication?); nothing
definite springs to mind off-hand for me, but perhaps others will think
of things.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: Speeding up pg_upgrade

2017-12-05 Thread Dave Page

Hi

On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian  wrote:

> As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> zero-downtime upgrades.  After the usual discussion of using logical
> replication, Slony, and perhaps having the server be able to read old
> and new system catalogs, we discussed speeding up pg_upgrade.
>
> There are clusters that take a long time to dump the schema from the old
> cluster and recreate it in the new cluster.  One idea of speeding up
> pg_upgrade would be to allow pg_upgrade to be run in two stages:
>
> 1.  prevent system catalog changes while the old cluster is running, and
> dump the old cluster's schema and restore it in the new cluster
>
> 2.  shut down the old cluster and copy/link the data files
>

When we were discussing this, I was thinking that the linking could be done
in phase 1 too, as that's potentially slow on a very large schema.


>
> My question is whether the schema dump/restore is time-consuming enough
> to warrant this optional more complex API, and whether people would
> support adding a server setting that prevented all system table changes?


I've certainly heard of cases where pg_upgrade takes significant amounts of
time to run on very complex databases.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speeding up pg_upgrade

2017-12-05 Thread Bruce Momjian

On Tue, Dec  5, 2017 at 09:16:02AM -0500, Stephen Frost wrote:
> Bruce,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> > zero-downtime upgrades.  After the usual discussion of using logical
> > replication, Slony, and perhaps having the server be able to read old
> > and new system catalogs, we discussed speeding up pg_upgrade.
> 
> Sounds familiar.

Yeah.  :-|

> > There are clusters that take a long time to dump the schema from the old
> > cluster and recreate it in the new cluster.  One idea of speeding up
> > pg_upgrade would be to allow pg_upgrade to be run in two stages:
> > 
> > 1.  prevent system catalog changes while the old cluster is running, and
> > dump the old cluster's schema and restore it in the new cluster
> > 
> > 2.  shut down the old cluster and copy/link the data files
> 
> Perhaps a bit more complicated, but couldn't we copy/link while the
> old cluster is online and in backup mode, finish backup mode, shut down
> the old cluster, and then play forward the WAL to catch any relation
> extents being added or similar, and then flip to the new PG version?

Well, that would require reading the old WAL, which would add an
additional compibility requirement that seems unwise.

> > My question is whether the schema dump/restore is time-consuming enough
> > to warrant this optional more complex API, and whether people would
> > support adding a server setting that prevented all system table changes?
> 
> When you say 'system table changes', you're referring to basically all
> DDL, right?  Just wish to clarify as there might be some confusion
> between the terminology you're using here and allow_system_table_mods.

Not only all DDL, but even updating them for the internal stuff, like
pg_class.relfrozenxid.

> Would we need to have autovacuum shut down too..?

Yes.

> The other concern is if there's changes made to the catalogs by non-DDL
> activity that needs to be addressed too (logical replication?); nothing
> definite springs to mind off-hand for me, but perhaps others will think
> of things.

Yes, it could extend to many parts of the system, which is why I am
asking if it is worth it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Speeding up pg_upgrade

2017-12-05 Thread Bruce Momjian

On Tue, Dec  5, 2017 at 11:16:26PM +0900, Dave Page wrote:
> Hi
> 
> On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian  wrote:
> 
> As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> zero-downtime upgrades.  After the usual discussion of using logical
> replication, Slony, and perhaps having the server be able to read old
> and new system catalogs, we discussed speeding up pg_upgrade.
> 
> There are clusters that take a long time to dump the schema from the old
> cluster and recreate it in the new cluster.  One idea of speeding up
> pg_upgrade would be to allow pg_upgrade to be run in two stages:
> 
> 1.  prevent system catalog changes while the old cluster is running, and
> dump the old cluster's schema and restore it in the new cluster
> 
> 2.  shut down the old cluster and copy/link the data files
> 
> 
> When we were discussing this, I was thinking that the linking could be done in
> phase 1 too, as that's potentially slow on a very large schema.

Uh, good point!  You can create the hard links while system system is
running, no problem!  It would only be copy that can't be done while the
system is running.  Of course a big question is whether hard linking
takes any measurable time.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Speeding up pg_upgrade

2017-12-05 Thread Stephen Frost

Dave,

* Dave Page (dp...@pgadmin.org) wrote:
> On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian  wrote:
> > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> > zero-downtime upgrades.  After the usual discussion of using logical
> > replication, Slony, and perhaps having the server be able to read old
> > and new system catalogs, we discussed speeding up pg_upgrade.
> >
> > There are clusters that take a long time to dump the schema from the old
> > cluster and recreate it in the new cluster.  One idea of speeding up
> > pg_upgrade would be to allow pg_upgrade to be run in two stages:
> >
> > 1.  prevent system catalog changes while the old cluster is running, and
> > dump the old cluster's schema and restore it in the new cluster
> >
> > 2.  shut down the old cluster and copy/link the data files
> 
> When we were discussing this, I was thinking that the linking could be done
> in phase 1 too, as that's potentially slow on a very large schema.

Right, I had that thought too when first reading this, but the problem
there is that new files can show up due to a relation being extended (at
least, and perhaps in other cases too..).

> > My question is whether the schema dump/restore is time-consuming enough
> > to warrant this optional more complex API, and whether people would
> > support adding a server setting that prevented all system table changes?
> 
> I've certainly heard of cases where pg_upgrade takes significant amounts of
> time to run on very complex databases.

Right, but that doesn't really answer the question as to which part of
the pg_upgrade process is taking up the time.

In any case, of course, if we're able to move part of what pg_upgrade
does to be while the old server is online then that takes whatever the
cost of that is out of the downtime window.  The question is if that's a
5% improvement in the overall performance of pg_upgrade or a 70% one.
This will be case-by-case, of course, but if, in the best-case, we only
get a 5% improvement then this might not be worth the risk.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: Speeding up pg_upgrade

2017-12-05 Thread Bruce Momjian

On Tue, Dec  5, 2017 at 09:23:49AM -0500, Stephen Frost wrote:
> Dave,
> 
> * Dave Page (dp...@pgadmin.org) wrote:
> > On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian  wrote:
> > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
> > > zero-downtime upgrades.  After the usual discussion of using logical
> > > replication, Slony, and perhaps having the server be able to read old
> > > and new system catalogs, we discussed speeding up pg_upgrade.
> > >
> > > There are clusters that take a long time to dump the schema from the old
> > > cluster and recreate it in the new cluster.  One idea of speeding up
> > > pg_upgrade would be to allow pg_upgrade to be run in two stages:
> > >
> > > 1.  prevent system catalog changes while the old cluster is running, and
> > > dump the old cluster's schema and restore it in the new cluster
> > >
> > > 2.  shut down the old cluster and copy/link the data files
> > 
> > When we were discussing this, I was thinking that the linking could be done
> > in phase 1 too, as that's potentially slow on a very large schema.
> 
> Right, I had that thought too when first reading this, but the problem
> there is that new files can show up due to a relation being extended (at
> least, and perhaps in other cases too..).

Oh, yikes, yes.

> > > My question is whether the schema dump/restore is time-consuming enough
> > > to warrant this optional more complex API, and whether people would
> > > support adding a server setting that prevented all system table changes?
> > 
> > I've certainly heard of cases where pg_upgrade takes significant amounts of
> > time to run on very complex databases.
> 
> Right, but that doesn't really answer the question as to which part of
> the pg_upgrade process is taking up the time.
> 
> In any case, of course, if we're able to move part of what pg_upgrade
> does to be while the old server is online then that takes whatever the
> cost of that is out of the downtime window.  The question is if that's a
> 5% improvement in the overall performance of pg_upgrade or a 70% one.
> This will be case-by-case, of course, but if, in the best-case, we only
> get a 5% improvement then this might not be worth the risk.

Yes, and who is going to know if they have a setup where the more
complex API is worth it?  pg_upgrade is already complex enough to use.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Speeding up pg_upgrade

2017-12-05 Thread Stephen Frost

Bruce,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Dec  5, 2017 at 09:16:02AM -0500, Stephen Frost wrote:
> > > There are clusters that take a long time to dump the schema from the old
> > > cluster and recreate it in the new cluster.  One idea of speeding up
> > > pg_upgrade would be to allow pg_upgrade to be run in two stages:
> > > 
> > > 1.  prevent system catalog changes while the old cluster is running, and
> > > dump the old cluster's schema and restore it in the new cluster
> > > 
> > > 2.  shut down the old cluster and copy/link the data files
> > 
> > Perhaps a bit more complicated, but couldn't we copy/link while the
> > old cluster is online and in backup mode, finish backup mode, shut down
> > the old cluster, and then play forward the WAL to catch any relation
> > extents being added or similar, and then flip to the new PG version?
> 
> Well, that would require reading the old WAL, which would add an
> additional compibility requirement that seems unwise.

In my proposal, this would be the old version of PG reading the old WAL.

Thinking about it a bit further though, I'm not sure it'd end up working
in link mode anyway, due to post-backup-finish changes that could be
made by the old server on the data files before it's shut down.

We have to have a way of dealing with the delta between the hard link
trees after the old server is shut down though because there could be
new relation extents, at least.

> > > My question is whether the schema dump/restore is time-consuming enough
> > > to warrant this optional more complex API, and whether people would
> > > support adding a server setting that prevented all system table changes?
> > 
> > When you say 'system table changes', you're referring to basically all
> > DDL, right?  Just wish to clarify as there might be some confusion
> > between the terminology you're using here and allow_system_table_mods.
> 
> Not only all DDL, but even updating them for the internal stuff, like
> pg_class.relfrozenxid.

Good point.  We'd really need a pretty bullet-proof way to ensure that
the catalog isn't changed during this time period and that seems like it
might be difficult without a lot of work.

> > Would we need to have autovacuum shut down too..?
> 
> Yes.

Ok, makes sense.

> > The other concern is if there's changes made to the catalogs by non-DDL
> > activity that needs to be addressed too (logical replication?); nothing
> > definite springs to mind off-hand for me, but perhaps others will think
> > of things.
> 
> Yes, it could extend to many parts of the system, which is why I am
> asking if it is worth it.

My initial reaction is that it's worth it, but then I also wonder about
other issues (having to get an ANALYZE done on the new cluster before
opening it up, for example..) and it makes me wonder if perhaps it'll
end up being too much risk for too little gain.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: Speeding up pg_upgrade

2017-12-05 Thread Stephen Frost

Bruce,

* Bruce Momjian (br...@momjian.us) wrote:
> > In any case, of course, if we're able to move part of what pg_upgrade
> > does to be while the old server is online then that takes whatever the
> > cost of that is out of the downtime window.  The question is if that's a
> > 5% improvement in the overall performance of pg_upgrade or a 70% one.
> > This will be case-by-case, of course, but if, in the best-case, we only
> > get a 5% improvement then this might not be worth the risk.
> 
> Yes, and who is going to know if they have a setup where the more
> complex API is worth it?  pg_upgrade is already complex enough to use.

Sure, but the solution there is really to make pg_upgrade simpler to
use, even as we add these more complicated APIs to it.  What that likely
means in practical terms is that we have another utility, which uses
pg_upgrade underneath, that you're able to configure to know about your
existing cluster and the version of PG you want to upgrade to and where
you want it and if you want a copy or if hard-links are ok, etc.

Having such a tool is actually what I'd been hoping would come out of
the documented process for doing a "pg_upgrade" on replicas that's
currently in our documentation.  That's not happened yet, but it's
something that David Steele and I have been chatting about because the
procedure in the documentation is terribly difficult and dangerous for
those who aren't as familiar with the system.

Perhaps we could have one tool that handles both the more complicated
pg_upgrade API and deals with upgrading replicas.  Alternatively, we
could have a config file for pg_upgrade instead which might be a simpler
way for people to describe exactly their current configuration and what
they'd like to go to.  Upgrading replicas involves using something like
SSH though..

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: simplehash: tb->sizemask = 0

2017-12-05 Thread Todd A. Cook


On 11/29/17 13:49, Andres Freund wrote:

On 2017-11-27 22:53:41 -0500, Tom Lane wrote:

Tomas Vondra  writes:

I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:



 if (tb->size == SH_MAX_SIZE)
 tb->sizemask = 0;
 else
 tb->sizemask = tb->size - 1;



Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
(i.e. no bits set)?


Yeah, which is very obviously broken: for one thing, the Asserts
in SH_NEXT/SH_PREV would surely go off.


That's obviously wrong. Not sure how that happened. I might have had it
as a shift at first?



(Why are those assertions, anyway, and not test-and-elog?
I do not think an assertion failure is a suitable way to
report "hash table full".)


There's a test and elog during insert. Adding actual branches into
SH_NEXT/SH_PREV seems like a bad idea.

Will test a fix.


I'll be happy to help test this fix when it's ready.

-- todd

Re: Add PGDLLIMPORT lines to some variables

2017-12-05 Thread Robert Haas

On Mon, Dec 4, 2017 at 8:48 PM, Peter Geoghegan  wrote:
> On Mon, Dec 4, 2017 at 5:41 PM, Noah Misch  wrote:
>>> I don't think we quite have an established protocol for this. I
>>> personally, but I'm biased in this specific case, is that we should
>>> adopt a position that PGDLLIMPORTs should basically backpatched whenever
>>> a credible extension even halfway reasonably requires it. There's no
>>> easy way to get this done by default, and we're so far unwilling to just
>>> slap this onto every variable. So to not further disadvantage people
>>> force dto live in the MS environment, that seems the sanest
>>> solution. It's not like these are high risk.
>>
>> +1
>
> If that's going to be the policy, then I have some requests of my own.
> I would like to add some PGDLLIMPORTs to suit the external version of
> amcheck (the version that targets earlier versions of Postgres). These
> are:
>
> RecentGlobalXmin -- This is only PGDLLIMPORT on Postgres 10+,
> following commit 56018bf2. I'd like to get that back to 9.4, although
> there is no reason to not include 9.3.
>
> TransactionXmin -- This is needed for the newer heap-matches-index
> verification check. Again, I would like this on 9.4+, but 9.3+ works
> too.
>
> Note that somebody asked about running it on Windows recently, and on
> one other occasion in the past. It does come up.

Committed with these additions.  Please check that I haven't messed anything up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread Robert Haas

On Mon, Dec 4, 2017 at 6:57 PM, David Rowley
 wrote:
> On 2 December 2017 at 03:39, Robert Haas  wrote:
>> On Thu, Nov 30, 2017 at 11:39 PM, David Rowley
>>  wrote:
>>> I feel like we could do better here with little extra effort. The
>>> DETACH index feature does not really seem required for this patch.
>>
>> Because of the dump/restore considerations mentioned in
>> http://postgr.es/m/ca+tgmobuhghg9v8saswkhbbfywg5a0qb+jgt0uovq5ycbdu...@mail.gmail.com
>> I believe we need a way to create the index on the parent without
>> immediately triggering index builds on the children, plus a way to
>> create an index on a child after-the-fact and attach it to the parent.
>> Detach isn't strictly required, but why support attach and not detach?
>
> I proposed that this worked a different way in [1]. I think the way I
> mention is cleaner as it means there's no extra reason for a
> partitioned index to be indisvalid=false than there is for any other
> normal index.

How does that proposal keep pg_dump from latching onto the wrong
index, if there's more than one index on the same columns?

> My primary reason for not liking this way is around leaving indexes
> around as indisvalid=false, however, I suppose that an index could be
> replaced atomically with a DETACH and ATTACH within a single
> transaction. I had previously not really liked the idea of
> invalidating an index by DETACHing a leaf table's index from it. Of
> course, this patch does nothing special with partitioned indexes, but
> I believe one day we will want to do something with these and that the
> DETACH/ATTACH is not the best design for replacing part of the index.

What do you propose?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] postgres_fdw super user checks

2017-12-05 Thread Robert Haas

On Mon, Dec 4, 2017 at 5:57 PM, Ashutosh Bapat
 wrote:
> I think the real behaviour can be described as something like this:
>
> "Only superusers may connect to foreign servers without password
> authentication, so always specify the password
> option for user mappings that may be used by non-superusers." But
> which user mappings may be used by non-superusers can not be defined
> without explaining views owned by superusers. I don't think we should
> be talking about views in that part of documentation.

Well, if we don't, then I'm not sure we can really make this clear.

Anyhow, I've committed the patch to master for now; we can keep
arguing about what, if anything, to do for back-branch documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] postgres_fdw super user checks

2017-12-05 Thread Stephen Frost

Robert, Ashutosh,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 5:57 PM, Ashutosh Bapat
>  wrote:
> > I think the real behaviour can be described as something like this:
> >
> > "Only superusers may connect to foreign servers without password
> > authentication, so always specify the password
> > option for user mappings that may be used by non-superusers." But
> > which user mappings may be used by non-superusers can not be defined
> > without explaining views owned by superusers. I don't think we should
> > be talking about views in that part of documentation.
> 
> Well, if we don't, then I'm not sure we can really make this clear.

Yeah, I'm pretty sure we need to spell out the situation around views
here because it's different from how views normally work as discussed in
Rules and Privileges.

I'll note that the Rules and Privileges section could use a bit of love
too- the v10 docs have:

"Due to rewriting of queries by the PostgreSQL rule system, other
tables/views than those used in the original query get accessed. When
update rules are used, this can include write access to tables."

Which isn't really accurate since simple updatable views were added.

Looking at it more though, really, I think that whole page needs to be
re-cast to be about *views* and stop talking about rules.  That's really
a seperate discussino to have though.

> Anyhow, I've committed the patch to master for now; we can keep
> arguing about what, if anything, to do for back-branch documentation.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: pg_dumpall -r -c try to drop user postgres

2017-12-05 Thread Pavel Stehule

2017-12-03 23:48 GMT+01:00 Michael Paquier :

> On Sun, Dec 3, 2017 at 3:21 PM, Pavel Stehule 
> wrote:
> > I am not sure if user postgres should be removed, so it is probably bug
> >
> > pg_dumpall -r -c | grep postgres
> >
> > DROP ROLE postgres;
> > CREATE ROLE postgres;
>
> You are looking for this bit of code:
> /*
>  * If asked to --clean, do that first.  We can avoid
> detailed
>  * dependency analysis because databases never depend
> on each other,
>  * and tablespaces never depend on each other.  Roles
> could have
>  * grants to each other, but DROP ROLE will clean
> those up silently.
>  */
> if (output_clean)
> {
> if (!globals_only && !roles_only &&
> !tablespaces_only)
> dropDBs(conn);
>
> if (!roles_only && !no_tablespaces)
> dropTablespaces(conn);
>
> if (!tablespaces_only)
> dropRoles(conn);
> }
> Could you clarify what you think is wrong here?
>

I am not sure, if issue is in this code. But I am sure, so DROP ROLE
postgres is just nonsense

This command should to fail every time, and then should not be generated.

Regards

Pavel



> --
> Michael
>

Re: Add PGDLLIMPORT lines to some variables

2017-12-05 Thread Peter Geoghegan

On Tue, Dec 5, 2017 at 6:49 AM, Robert Haas  wrote:
>> RecentGlobalXmin -- This is only PGDLLIMPORT on Postgres 10+,
>> following commit 56018bf2. I'd like to get that back to 9.4, although
>> there is no reason to not include 9.3.
>>
>> TransactionXmin -- This is needed for the newer heap-matches-index
>> verification check. Again, I would like this on 9.4+, but 9.3+ works
>> too.
>>
>> Note that somebody asked about running it on Windows recently, and on
>> one other occasion in the past. It does come up.
>
> Committed with these additions.  Please check that I haven't messed anything 
> up.

Thanks, but you modified RecentGlobalDataXmin, not RecentGlobalXmin.
Can you fix it, please?

-- 
Peter Geoghegan

Re: Add PGDLLIMPORT lines to some variables

2017-12-05 Thread Peter Geoghegan

On Tue, Dec 5, 2017 at 9:09 AM, Peter Geoghegan  wrote:
>> Committed with these additions.  Please check that I haven't messed anything 
>> up.
>
> Thanks, but you modified RecentGlobalDataXmin, not RecentGlobalXmin.
> Can you fix it, please?

I was looking at the wrong branch. Forget it.

Thanks
-- 
Peter Geoghegan

dsa_allocate could not find 4 free pages

2017-12-05 Thread Mark Dilger

Hackers,

I'm getting the error in $subject and am only posting here because (a) the 
comments
in src/backend/utils/mmgr/dsa.c circa line 720 suggest that this is a bug, and 
(b) the
DSA code appears to be fairly new to the postgresql project, and perhaps not 
fully
debugged.

I am running against a build from 291a31c42c893574e9676e00121e6c6915a59de5
REL_10_STABLE, on centos linux.

I'm not asking for help configuring my system.  If this is a run of the mill 
out of memory
type error, I'll deal with it myself.  But if this indicates a bug, I'll be 
happy to try to distill
what I'm doing to a test case that I can share.  (I've only hit this once, and 
have little
idea whether it would be reproducible.)  The gist of the query was an array_agg 
over
the results of a parallel query over a partitioned table, executed from within 
a plpgsql
function, like:

SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS 
trans FROM
(SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS latest, 
SUM(cnt) AS cnt FROM
(SELECT a, b, timelogged, COUNT(*) AS cnt
FROM mytable
WHERE a= 
ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[])
AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP
AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP
GROUP BY a, b, timelogged
) AS ss
GROUP BY a, b, timelogged
) AS s

If this is not a bug, but rather something that users can hit through normal 
resource
exhaustion, then perhaps the comments in dsa.c can be changed to make that 
clear.

Thanks!

mark

Re: Usage of epoch in txid_current

2017-12-05 Thread Andres Freund

On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:
> On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
>  wrote:
> > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila  wrote:
> >>
> >> Currently, txid_current and friends export a 64-bit format of
> >> transaction id that is extended with an “epoch” counter so that it
> >> will not wrap around during the life of an installation.   The epoch
> >> value it uses is based on the epoch that is maintained by checkpoint
> >> (aka only checkpoint increments it).
> >>
> >> Now if epoch changes multiple times between two checkpoints
> >> (practically the chances of this are bleak, but there is a theoretical
> >> possibility), then won't the computation of xids will go wrong?
> >> Basically, it can give the same value of txid after wraparound if the
> >> checkpoint doesn't occur between the two calls to txid_current.
> >
> >
> > AFAICS, yes, if epoch changes multiple times between two checkpoints, then
> > computation will go wrong.  And it doesn't look like purely theoretical
> > possibility for me, because I think I know couple of instances of the edge
> > of this...

I think it's not terribly likely principle, due to the required WAL
size. You need at least a commit record for each of 4 billion
transactions. Each commit record is at least 24bytes long, and in a
non-artificial scenario you additionally would have a few hundred bytes
of actual content of WAL. So we're talking about a distance of at least
0.5-2TB within a single checkpoint here. Not impossible, but not likely
either.

> Okay, it is quite strange that we haven't discovered this problem till
> now.  I think we should do something to fix it.  One idea is that we
> track epoch change in shared memory (probably in the same data
> structure (VariableCacheData) where we track nextXid).  We need to
> increment it when the xid wraparound during xid allocation (in
> GetNewTransactionId).  Also, we need to make it persistent as which
> means we need to log it in checkpoint xlog record and we need to write
> a separate xlog record for the epoch change.

I think it makes a fair bit of sense to not do the current crufty
tracking of xid epochs. I don't really how we got there, but it doesn't
make terribly much sense. Don't think we need additional WAL logging
though - we should be able to piggyback this onto the already existing
clog logging.

I kinda wonder if we shouldn't just track nextXid as a 64bit integer
internally, instead of bothering with tracking the epoch
separately. Then we can "just" truncate it in the cases where it's
stored in space constrained places etc.

Greetings,

Andres Freund

Re: Speeding up pg_upgrade

2017-12-05 Thread Peter Eisentraut

On 12/5/17 09:23, Stephen Frost wrote:
> Right, but that doesn't really answer the question as to which part of
> the pg_upgrade process is taking up the time.

Yeah, that should be measured before we do anything.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Usage of epoch in txid_current

2017-12-05 Thread Stephen Frost

Andres,

* Andres Freund (and...@anarazel.de) wrote:
> On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:
> > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
> >  wrote:
> > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila  
> > > wrote:
> > >>
> > >> Currently, txid_current and friends export a 64-bit format of
> > >> transaction id that is extended with an “epoch” counter so that it
> > >> will not wrap around during the life of an installation.   The epoch
> > >> value it uses is based on the epoch that is maintained by checkpoint
> > >> (aka only checkpoint increments it).
> > >>
> > >> Now if epoch changes multiple times between two checkpoints
> > >> (practically the chances of this are bleak, but there is a theoretical
> > >> possibility), then won't the computation of xids will go wrong?
> > >> Basically, it can give the same value of txid after wraparound if the
> > >> checkpoint doesn't occur between the two calls to txid_current.
> > >
> > >
> > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then
> > > computation will go wrong.  And it doesn't look like purely theoretical
> > > possibility for me, because I think I know couple of instances of the edge
> > > of this...
> 
> I think it's not terribly likely principle, due to the required WAL
> size. You need at least a commit record for each of 4 billion
> transactions. Each commit record is at least 24bytes long, and in a
> non-artificial scenario you additionally would have a few hundred bytes
> of actual content of WAL. So we're talking about a distance of at least
> 0.5-2TB within a single checkpoint here. Not impossible, but not likely
> either.

At the bottom end, with a 30-minute checkpoint, that's about 300MB/s.
Certainly quite a bit and we might have trouble getting there for other
reasons, but definitely something that can be accomplished with even a
single SSD these days.

> > Okay, it is quite strange that we haven't discovered this problem till
> > now.  I think we should do something to fix it.  One idea is that we
> > track epoch change in shared memory (probably in the same data
> > structure (VariableCacheData) where we track nextXid).  We need to
> > increment it when the xid wraparound during xid allocation (in
> > GetNewTransactionId).  Also, we need to make it persistent as which
> > means we need to log it in checkpoint xlog record and we need to write
> > a separate xlog record for the epoch change.
> 
> I think it makes a fair bit of sense to not do the current crufty
> tracking of xid epochs. I don't really how we got there, but it doesn't
> make terribly much sense. Don't think we need additional WAL logging
> though - we should be able to piggyback this onto the already existing
> clog logging.

Don't you mean xact logging? ;)

> I kinda wonder if we shouldn't just track nextXid as a 64bit integer
> internally, instead of bothering with tracking the epoch
> separately. Then we can "just" truncate it in the cases where it's
> stored in space constrained places etc.

This sounds reasonable to me, at least, but I've not been in these
depths much.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-05 Thread Robert Haas

On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
 wrote:
> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila  wrote:
>> The patch still applies (with some hunks).  I have added it in CF [1]
>> to avoid losing track.
>>
>> [1] - https://commitfest.postgresql.org/15/1341/
>
> This did not get reviews and the patch still applies. I am moving it to next 
> CF.

I'd like to break this into two patches, one to fix the general
background worker problems and another to fix the problems specific to
parallel query.

The attached version extracts the two fixes for general background
worker problems from Amit's patch and adjust the comments a bit more
extensively than he did.  Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

bgworker-startup.patch
Description: Binary data

Re: Usage of epoch in txid_current

2017-12-05 Thread Andres Freund



On December 5, 2017 10:01:43 AM PST, Stephen Frost  wrote:
>Andres,
>
>* Andres Freund (and...@anarazel.de) wrote:
>> I think it makes a fair bit of sense to not do the current crufty
>> tracking of xid epochs. I don't really how we got there, but it
>doesn't
>> make terribly much sense. Don't think we need additional WAL logging
>> though - we should be able to piggyback this onto the already
>existing
>> clog logging.

>
>Don't you mean xact logging? ;)

No. We log a WAL record at clog boundaries. Wraparounds have to be at one. We 
could just include the 64 bit xid there and would have reliable tracking.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Transaction control in procedures

2017-12-05 Thread Peter Eisentraut

On 12/1/17 15:28, Robert Haas wrote:
> This feature doesn't have many tests.  I think it should have a lot
> more of them.  It's tinkering with the transaction control machinery
> of the system in a fairly fundamental way, and that could break
> things.

Thank you, these are great ideas.

> I suggest, in particular, testing how it interactions with resources
> such as cursors and prepared statements.  For example, what happens if
> you commit or roll back inside a cursor-for loop (given that the
> cursor is not holdable)?

This was discussed briefly earlier in the thread.  The mid-term fix is
to convert pinned cursors to holdable ones before a COMMIT in PL/pgSQL
and then clean them up separately later.  I have that mostly working,
but I'd like to hold it for a separate patch submission.  The short-term
fix is to prohibit COMMIT and ROLLBACK while a portal is pinned.

I think ROLLBACK in a cursor loop might not make sense, because the
cursor query itself could have side effects, so a rollback would have to
roll back the entire loop.  That might need more refined analysis before
it could be allowed.

> - COMMIT or ROLLBACK inside a PLpgsql block with an attached EXCEPTION
> block, or when an SQL SAVEPOINT has been established previously.

I think that needs to be prohibited because if you end transactions in
an exception-handled block, you can no longer actually roll back that
block when an exception occurs, which was the entire point.

> - COMMIT or ROLLBACK inside a procedure with a SET clause attached,

That also needs to be prohibited because of the way the GUC nesting
currently works.  It's probably possible to fix it, but it would be a
separate effort.

> and/or while SET LOCAL is in effect either at the inner or outer
> level.

That seems to work fine.

> - COMMIT or ROLLBACK with open large objects.

I haven't been able to reproduce any problems with that, but maybe I
haven't tried hard enough.

> - COMMIT inside a procedure fails because of a serialization failure,
> deferred constraint, etc.

That works fine.  The COMMIT fails and control exits the procedure using
the normal exception propagation.

I'll submit an updated patch with some fixes for the above and more
documentation.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Transaction control in procedures

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 1:25 PM, Peter Eisentraut
 wrote:
> I think ROLLBACK in a cursor loop might not make sense, because the
> cursor query itself could have side effects, so a rollback would have to
> roll back the entire loop.  That might need more refined analysis before
> it could be allowed.

COMMIT really has the same problem; if the cursor query has side
effects, you can't commit those side effects piecemeal as the loop
executed and have things behave sanely.

>> - COMMIT or ROLLBACK inside a procedure with a SET clause attached,
>
> That also needs to be prohibited because of the way the GUC nesting
> currently works.  It's probably possible to fix it, but it would be a
> separate effort.
>
>> and/or while SET LOCAL is in effect either at the inner or outer
>> level.
>
> That seems to work fine.

These two are related -- if you don't permit anything that makes
temporary changes to GUCs at all, like SET clauses attached to
functions, then SET LOCAL won't cause any problems.  The problem is if
you do a transaction operation when something set locally is in the
stack of values, but not at the top.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro
 wrote:
> As for how to aggregate the information, isn't it reasonable to show
> data from the last loop on the basis that it's representative?
> Summing wouldn't make too much sense, because you didn't use that much
> memory all at once.

Sorts can be rescanned even without parallel query, so I guess we
should try to make the parallel case kinda like the non-parallel case.
If I'm not wrong, that will just use the stats from the most recent
execution (i.e. the last loop) -- see show_sort_info() in explain.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Usage of epoch in txid_current

2017-12-05 Thread Tom Lane

Stephen Frost  writes:
> * Andres Freund (and...@anarazel.de) wrote:
>> I kinda wonder if we shouldn't just track nextXid as a 64bit integer
>> internally, instead of bothering with tracking the epoch
>> separately. Then we can "just" truncate it in the cases where it's
>> stored in space constrained places etc.

> This sounds reasonable to me, at least, but I've not been in these
> depths much.

+1 ... I think the reason it's like that is simply that nobody's revisited
the XID generator since we decided to require 64-bit integer support.

We'd need this for support of true 64-bit XIDs, too, though I'm unsure
whether that project is going anywhere anytime soon.  In any case it
seems like a separable subset of that work.

regards, tom lane

Re: dsa_allocate could not find 4 free pages

2017-12-05 Thread Thomas Munro

On Wed, Dec 6, 2017 at 6:17 AM, Mark Dilger  wrote:
> I'm not asking for help configuring my system.  If this is a run of the mill 
> out of memory
> type error, I'll deal with it myself.  But if this indicates a bug, I'll be 
> happy to try to distill
> what I'm doing to a test case that I can share.  (I've only hit this once, 
> and have little
> idea whether it would be reproducible.)  The gist of the query was an 
> array_agg over
> the results of a parallel query over a partitioned table, executed from 
> within a plpgsql
> function, like:
>
> SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS 
> trans FROM
> (SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS 
> latest, SUM(cnt) AS cnt FROM
> (SELECT a, b, timelogged, COUNT(*) AS cnt
> FROM mytable
> WHERE a= 
> ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[])
> AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP
> AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP
> GROUP BY a, b, timelogged
> ) AS ss
> GROUP BY a, b, timelogged
> ) AS s

Hi Mark,

Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan?

https://www.postgresql.org/message-id/flat/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila  wrote:
> Yeah, that sounds better, so modified the patch accordingly.

I committed this to master and REL_10_STABLE, but it conflicts all
over the place on 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: es_query_dsa is broken

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila  wrote:
> + EState *estate = gatherstate->ps.state;
> +
> + /* Install our DSA area while executing the plan. */
> + estate->es_query_dsa = gatherstate->pei->area;
>   outerTupleSlot = ExecProcNode(outerPlan);
> + estate->es_query_dsa = NULL;
>
> Won't the above coding pattern create a problem, if ExecProcNode
> throws an error and outer block catches it and continues execution
> (consider the case of execution inside PL blocks)?

I don't see what the problem is.  The query that got aborted by the
error wouldn't be sharing an EState with one that didn't.  Control
would have to return to someplace inside the same ExecProcNode and it
would have to return normally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] pow support for pgbench

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 7:44 AM, Raúl Marín Rodríguez
 wrote:
> I've been giving a thought about this and I think we could reach the
> compromise
> of having a single function with 2 overloads:
> * pow(double, double) -> double: Uses C pow().
> * pow(int, int) -> double: Uses ipow() for positive exponents, and pow()
> for negative exponents.
>
> In both cases we'd return a double but we use the fast ipow if it's possible
> (which can be 20x faster), so at the cost of an extra cast if you need an
> int,
> we'd have a consistent API. Would this be acceptable?

It seems OK to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Walsender timeouts and large transactions

2017-12-05 Thread Robert Haas

On Mon, Dec 4, 2017 at 10:59 PM, Craig Ringer  wrote:
> To me it looks like it's time to get this committed, marking as such.

This version has noticeably more code rearrangement than before, and
I'm not sure that is actually buying us anything.  Why not keep the
changes minimal?

Also, TBH, this doesn't seem to have been carefully reviewed for style:

-if (!pq_is_send_pending())
-return;
+/* Try taking fast path unless we get too close to walsender timeout. */
+if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
+  wal_sender_timeout / 2))
+{
+if (!pq_is_send_pending())
+return;
+}

Generally we write if (a && b) { ... } not if (a) { if (b) .. }

-}
+};

It's hard to understand how this got through review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dsa_allocate could not find 4 free pages

2017-12-05 Thread Mark Dilger


> On Dec 5, 2017, at 11:25 AM, Thomas Munro  
> wrote:
> 
> On Wed, Dec 6, 2017 at 6:17 AM, Mark Dilger  wrote:
>> I'm not asking for help configuring my system.  If this is a run of the mill 
>> out of memory
>> type error, I'll deal with it myself.  But if this indicates a bug, I'll be 
>> happy to try to distill
>> what I'm doing to a test case that I can share.  (I've only hit this once, 
>> and have little
>> idea whether it would be reproducible.)  The gist of the query was an 
>> array_agg over
>> the results of a parallel query over a partitioned table, executed from 
>> within a plpgsql
>> function, like:
>> 
>> SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS 
>> trans FROM
>>(SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS 
>> latest, SUM(cnt) AS cnt FROM
>>(SELECT a, b, timelogged, COUNT(*) AS cnt
>>FROM mytable
>>WHERE a= 
>> ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[])
>>AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP
>>AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP
>>GROUP BY a, b, timelogged
>>) AS ss
>>GROUP BY a, b, timelogged
>> ) AS s
> 
> Hi Mark,
> 
> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan?

This was encountered and logged by a java client.  The only data I got was:

org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free 
pages
  Where: parallel worker

I know what the query was, because it is the only query this client issues, but 
there is
no record of what the plan might have been.  The java client runs queries 
spanning a long
time period followed by multiple queries spanning short time periods.  The sql 
query is
the same in both cases; only the bind parameters for the time frame differ.  
The short time
period queries do not generate parallel plans (in my experience), but the long 
time period
ones do, with plans that look something like this:

 Aggregate  (cost=31552146.47..31552146.48 rows=1 width=32)
   ->  WindowAgg  (cost=14319575.96..31538146.47 rows=80 width=43)
 ->  GroupAggregate  (cost=14319575.96..31526146.47 rows=80 
width=39)
   Group Key: acct_r.mta, acct_r.dtype, (moment(acct_r.timelogged))
   ->  Finalize GroupAggregate  (cost=14319575.96..31356146.47 
rows=800 width=15)
 Group Key: acct_r.mta, acct_r.dtype, 
(moment(acct_r.timelogged))
 ->  Gather Merge  (cost=14319575.96..30056146.47 
rows=12000 width=15)
   Workers Planned: 15
   ->  Partial GroupAggregate  
(cost=14318575.64..14755146.15 rows=800 width=15)
 Group Key: acct_r.mta, acct_r.dtype, 
(moment(acct_r.timelogged))
 ->  Sort  (cost=14318575.64..14385889.74 
rows=26925641 width=7)
   Sort Key: acct_r.mta, acct_r.dtype, 
(moment(acct_r.timelogged))
   ->  Result  (cost=0.00..10627491.52 
rows=26925641 width=7)
 ->  Append  
(cost=0.00..10290921.01 rows=26925641 width=7)
   ->  Parallel Seq Scan on 
acct_r  (cost=0.00..784882.50 rows=1805625 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_d  (cost=0.00..542032.80 rows=1757640 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_b  (cost=0.00..111584.91 rows=206096 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_x  (cost=0.00..20661.29 rows=48165 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_r acct_r_1  (cost=0.00..795917.43 rows=1831322 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_d acct_d_1  (cost=0.00..549630.82 rows=1782141 width=7)
 Filter: (mta = ANY 
('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[]))
   ->  Parallel Seq Scan on 
acct_b acct_b_1

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread David Rowley

On 6 December 2017 at 04:29, Robert Haas  wrote:
> On Mon, Dec 4, 2017 at 6:57 PM, David Rowley
>> I proposed that this worked a different way in [1]. I think the way I
>> mention is cleaner as it means there's no extra reason for a
>> partitioned index to be indisvalid=false than there is for any other
>> normal index.
>
> How does that proposal keep pg_dump from latching onto the wrong
> index, if there's more than one index on the same columns?

I'm not hugely concerned about that. It's not a new problem and it's
not a problem that I recall seeing anyone complain about, at least not
to the extent that we've ever bothered to fix it.

The existing problem is with FOREIGN KEY constraints just choosing the
first matching index in transformFkeyCheckAttrs()

We can see the issue today with:

create table t1 (id int not null);
create unique index t1_idx_b on t1 (id);
create table t2 (id int references t1 (id));
create unique index t1_idx_a on t1 (id);

# drop index t1_idx_a;
ERROR:  cannot drop index t1_idx_a because other objects depend on it
DETAIL:  constraint t2_id_fkey on table t2 depends on index t1_idx_a
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

>> My primary reason for not liking this way is around leaving indexes
>> around as indisvalid=false, however, I suppose that an index could be
>> replaced atomically with a DETACH and ATTACH within a single
>> transaction. I had previously not really liked the idea of
>> invalidating an index by DETACHing a leaf table's index from it. Of
>> course, this patch does nothing special with partitioned indexes, but
>> I believe one day we will want to do something with these and that the
>> DETACH/ATTACH is not the best design for replacing part of the index.
>
> What do you propose?

For this patch, I propose we do nothing about that, but for the
future, I already mentioned an idea to do this above.

> I'd have thought some sort of: ALTER INDEX name_of_partitioned_index
> REPLACE INDEX FOR partitioned_tablename WITH
> name_of_new_matching_bloat_free_index;

This way there is no period where we have to mark the index as
indisvalid=false. So, for when we invent something that uses this
feature, there's no window of time that concurrently running queries
are in danger of generating a poor plan. The index swap operation is
atomic.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread Alvaro Herrera

David Rowley wrote:
> On 6 December 2017 at 04:29, Robert Haas  wrote:
> > On Mon, Dec 4, 2017 at 6:57 PM, David Rowley
> >> I proposed that this worked a different way in [1]. I think the way I
> >> mention is cleaner as it means there's no extra reason for a
> >> partitioned index to be indisvalid=false than there is for any other
> >> normal index.
> >
> > How does that proposal keep pg_dump from latching onto the wrong
> > index, if there's more than one index on the same columns?
> 
> I'm not hugely concerned about that. It's not a new problem and it's
> not a problem that I recall seeing anyone complain about, at least not
> to the extent that we've ever bothered to fix it.

Another reason to do things the proposed way is to let parallel restore
create the indexes in parallel.  If we just have a single CREATE INDEX
that cascades to the partitions, that can be run by a single backend
only, which is a loser.

(There's also the fact that you'd lose any COMMENTs etc).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: compress method for spgist - 2

2017-12-05 Thread Alexander Korotkov

On Tue, Dec 5, 2017 at 1:14 PM, Darafei Praliaskouski  wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:tested, passed
>
> I've read the updated patch and see my concerns addressed.
>
> I'm looking forward to SP-GiST compress method support, as it will allow
> usage of SP-GiST index infrastructure for PostGIS.
>
> The new status of this patch is: Ready for Committer
>

I went trough this patch.  I found documentation changes to be not
sufficient.  And I've made some improvements.

In particular, I didn't understand why is reconstructedValue claimed to be
of spgConfigOut.leafType while it should be of spgConfigIn.attType both
from general logic and code.  I've fixed that.  Nikita, correct me if I'm
wrong.

Also, I wonder should we check for existence of compress method when
attType and leafType are not the same in spgvalidate() function?  We don't
do this for now.

0002-spgist-polygon-8.patch is OK for me so soon.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

0001-spgist-compress-method-9.patch
Description: Binary data

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Thomas Munro

On Wed, Dec 6, 2017 at 12:02 AM, Amit Kapila  wrote:
> On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro
>  wrote:
>> Or would it be an option to change the time
>> ExecXXXRetrieveInstrumentation() is called so it is run only once?
>
> To me, that doesn't sound like a bad option.  I think if do so, then
> we don't even need to reinitialize the shared sort stats.  I think
> something, like attached, should work if we want to go this route.  We
> can add regression test if this is what we think is a good idea.
> Having said that, one problem I see doing thing this way is that in
> general, we will display the accumulated stats of each worker, but for
> sort or some other special nodes (like hash), we will show the
> information of only last loop.  I am not sure if that is a matter of
> concern, but if we want to do this way, then probably we should
> explain this in documentation as well.

The hash version of this code is now committed as 5bcf389e.  Here is a
patch for discussion that adds some extra tests to join.sql to
exercise rescans of a hash join under a Gather node.  It fails on
head, because it loses track of the instrumentation pointer after the
first loop as you described (since the Hash coding is the same is the
Sort coding), so it finishes up with no instrumentation data.  If you
move ExecParallelRetrieveInstrumentation() to ExecParallelCleanup() as
you showed in your patch, then it passes.  The way I'm asserting that
instrumentation data is making its way back to the leader is by
turning off leader participation and then checking if it knows how
many batches there were.

-- 
Thomas Munro
http://www.enterprisedb.com

test-hash-join-rescan-instr-v1.patch
Description: Binary data

Re: [HACKERS] Parallel Append implementation

2017-12-05 Thread Robert Haas

On Tue, Nov 28, 2017 at 6:02 AM, amul sul  wrote:
> Here are the changes I did on v21 patch to handle crash reported by 
> Rajkumar[1]:
>
> diff --git a/src/backend/executor/nodeAppend.c
> b/src/backend/executor/nodeAppend.c
> index e3b17cf0e2..e0ee918808 100644
> --- a/src/backend/executor/nodeAppend.c
> +++ b/src/backend/executor/nodeAppend.c
> @@ -479,9 +479,12 @@ choose_next_subplan_for_worker(AppendState *node)
> pstate->pa_next_plan = append->first_partial_plan;
> else
> pstate->pa_next_plan++;
> -   if (pstate->pa_next_plan == node->as_whichplan)
> +
> +   if (pstate->pa_next_plan == node->as_whichplan ||
> +   (pstate->pa_next_plan == append->first_partial_plan &&
> +append->first_partial_plan >= node->as_nplans))
> {
> -   /* We've tried everything! */
> +   /* We've tried everything or there were no partial plans */
> pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
> LWLockRelease(&pstate->pa_lock);
> return false;

I changed this around a little, added a test case, and committed this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread Robert Haas

On Tue, Dec 5, 2017 at 3:53 PM, David Rowley
 wrote:
> I'm not hugely concerned about that. It's not a new problem and it's
> not a problem that I recall seeing anyone complain about, at least not
> to the extent that we've ever bothered to fix it.
>
> The existing problem is with FOREIGN KEY constraints just choosing the
> first matching index in transformFkeyCheckAttrs()
>
> We can see the issue today with:
>
> create table t1 (id int not null);
> create unique index t1_idx_b on t1 (id);
> create table t2 (id int references t1 (id));
> create unique index t1_idx_a on t1 (id);
>
> 
> 
>
> # drop index t1_idx_a;
> ERROR:  cannot drop index t1_idx_a because other objects depend on it
> DETAIL:  constraint t2_id_fkey on table t2 depends on index t1_idx_a
> HINT:  Use DROP ... CASCADE to drop the dependent objects too.

Ugh, that sucks.  I don't think it's a good argument for making the
problem worse, though.  What are we giving up by explicitly attaching
the correct index?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread David Rowley

On 6 December 2017 at 09:58, Alvaro Herrera  wrote:
> David Rowley wrote:
>> On 6 December 2017 at 04:29, Robert Haas  wrote:
>> > How does that proposal keep pg_dump from latching onto the wrong
>> > index, if there's more than one index on the same columns?
>>
>> I'm not hugely concerned about that. It's not a new problem and it's
>> not a problem that I recall seeing anyone complain about, at least not
>> to the extent that we've ever bothered to fix it.
>
> Another reason to do things the proposed way is to let parallel restore
> create the indexes in parallel.  If we just have a single CREATE INDEX
> that cascades to the partitions, that can be run by a single backend
> only, which is a loser.

I'm not all that sure why parallel restore could not work with this.
The leaf partition indexes would all be created first, then the CREATE
INDEX for the partitioned index would just be a metadata change with
some checks to ensure all the leaf partition indexes exist.

> (There's also the fact that you'd lose any COMMENTs etc).

I can't see why that would be a problem with my proposal.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-12-05 Thread Erik Rijkers


On 2017-11-27 17:55, Oliver Ford wrote:

Cheers here's v4 with the correct docs.


This email just to link related thread  "Add GROUPS option to the Window 
Functions":


https://www.postgresql.org/message-id/CAGMVOdtWkb9X7dUh7vjaCaiH34UGFg88unXYTEOub0Rk0swSXw%40mail.gmail.com

Re: dsa_allocate could not find 4 free pages

2017-12-05 Thread Thomas Munro

On Wed, Dec 6, 2017 at 9:35 AM, Mark Dilger  wrote:
>> On Dec 5, 2017, at 11:25 AM, Thomas Munro  
>> wrote:
>> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan?
>
> This was encountered and logged by a java client.  The only data I got was:
>
> org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free 
> pages
>   Where: parallel worker

This means that the DSA area is corrupted.  Presumably
get_best_segment(area, 4) returned a segment that wasn't actually good
for 4 pages, either because it was incorrectly binned or because its
free space btree was corrupted.  Another path would be that
make_new_segment(area, 4) returned a segment that couldn't find 4
pages, but that seems unlikely.

> [query plan with one Gather and no Parallel Bitmap Heap Scan]

I'm not sure why this plan would ever call dsa_allocate().

> [query plan with no Gather but plenty of Btimap Heap Scans]

And this one certainly can't.  I guess you must sometimes get a
different variation that has Gather nodes and uses Parallel Bitmap
Heap Scan.  Then the question is whether the es_query_dsa multiple
Gather bug can explain this: for example, if dsa_free(wrong_dsa_area,
p) was called, perhaps it could produce this type of corruption.
Otherwise we have a different bug.  Any clues on how to reproduce the
problem would be very welcome.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Walsender timeouts and large transactions

2017-12-05 Thread Craig Ringer

On 6 December 2017 at 04:07, Robert Haas  wrote:

> On Mon, Dec 4, 2017 at 10:59 PM, Craig Ringer 
> wrote:
> > To me it looks like it's time to get this committed, marking as such.
>
> This version has noticeably more code rearrangement than before, and
> I'm not sure that is actually buying us anything.  Why not keep the
> changes minimal?
>
> Also, TBH, this doesn't seem to have been carefully reviewed for style:
>
> -if (!pq_is_send_pending())
> -return;
> +/* Try taking fast path unless we get too close to walsender timeout.
> */
> +if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
> +  wal_sender_timeout / 2))
> +{
> +if (!pq_is_send_pending())
> +return;
> +}
>
> Generally we write if (a && b) { ... } not if (a) { if (b) .. }
>
> -}
> +};
>
> It's hard to understand how this got through review.
>


Entirely my fault - I tend to forget to look at code style when I'm focused
on functionality. My apologies, and thanks for the reminder.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread David Rowley

On 6 December 2017 at 11:35, Robert Haas  wrote:
> What are we giving up by explicitly attaching
> the correct index?

The part I don't like about the ATTACH and DETACH of partitioned index
is that it seems to be trying to just follow the syntax we use to
remove a partition from a partitioned table, however, there's a huge
difference between the two, as DETACHing a partition from a
partitioned table leaves the partitioned table in a valid state, it
simply just no longer contains the detached partition. With the
partitioned index, we leave the index in an invalid state after a
DETACH. It can only be made valid again once another leaf index has
been ATTACHED again and that we've verified that all other indexes on
every leaf partition is also there and are valid. If we're going to
use these indexes to answer queries, then it seems like we should try
to keep them valid so that queries can actually use them for
something.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: dsa_allocate could not find 4 free pages

2017-12-05 Thread Mark Dilger


> On Dec 5, 2017, at 4:07 PM, Thomas Munro  
> wrote:
> 
> On Wed, Dec 6, 2017 at 9:35 AM, Mark Dilger  wrote:
>>> On Dec 5, 2017, at 11:25 AM, Thomas Munro  
>>> wrote:
>>> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan?
>> 
>> This was encountered and logged by a java client.  The only data I got was:
>> 
>> org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free 
>> pages
>>  Where: parallel worker
> 
> This means that the DSA area is corrupted.  Presumably
> get_best_segment(area, 4) returned a segment that wasn't actually good
> for 4 pages, either because it was incorrectly binned or because its
> free space btree was corrupted.  Another path would be that
> make_new_segment(area, 4) returned a segment that couldn't find 4
> pages, but that seems unlikely.
> 
>> [query plan with one Gather and no Parallel Bitmap Heap Scan]
> 
> I'm not sure why this plan would ever call dsa_allocate().
> 
>> [query plan with no Gather but plenty of Btimap Heap Scans]
> 
> And this one certainly can't.  I guess you must sometimes get a
> different variation that has Gather nodes and uses Parallel Bitmap
> Heap Scan.

Yes, I can believe that the plan is sometimes different.  This error has
occurred several times now, but it is still rather infrequent, so either the
plan that triggers it is rare, or the bug is intermittent even with the same
plan being chosen, or perhaps both.

>  Then the question is whether the es_query_dsa multiple
> Gather bug can explain this: for example, if dsa_free(wrong_dsa_area,
> p) was called, perhaps it could produce this type of corruption.
> Otherwise we have a different bug.  Any clues on how to reproduce the
> problem would be very welcome.

I have written (and rewritten, and rewritten) a tap test in the hopes of
getting a test case that reproduces this reliably (or even once), but
without luck so far.  I will keep trying.

mark

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread Ashutosh Bapat

On Wed, Dec 6, 2017 at 9:42 AM, David Rowley
 wrote:
> On 6 December 2017 at 11:35, Robert Haas  wrote:
>> What are we giving up by explicitly attaching
>> the correct index?
>
> The part I don't like about the ATTACH and DETACH of partitioned index
> is that it seems to be trying to just follow the syntax we use to
> remove a partition from a partitioned table, however, there's a huge
> difference between the two, as DETACHing a partition from a
> partitioned table leaves the partitioned table in a valid state, it
> simply just no longer contains the detached partition. With the
> partitioned index, we leave the index in an invalid state after a
> DETACH. It can only be made valid again once another leaf index has
> been ATTACHED again and that we've verified that all other indexes on
> every leaf partition is also there and are valid. If we're going to
> use these indexes to answer queries, then it seems like we should try
> to keep them valid so that queries can actually use them for
> something.
>

+1 for all that. Exactly my thoughts.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-05 Thread David Rowley

On 6 December 2017 at 13:42, David Rowley  wrote:
> On 6 December 2017 at 11:35, Robert Haas  wrote:
>> What are we giving up by explicitly attaching
>> the correct index?
>
> The part I don't like about the ATTACH and DETACH of partitioned index
> is that it seems to be trying to just follow the syntax we use to
> remove a partition from a partitioned table, however, there's a huge
> difference between the two, as DETACHing a partition from a
> partitioned table leaves the partitioned table in a valid state, it
> simply just no longer contains the detached partition. With the
> partitioned index, we leave the index in an invalid state after a
> DETACH. It can only be made valid again once another leaf index has
> been ATTACHED again and that we've verified that all other indexes on
> every leaf partition is also there and are valid. If we're going to
> use these indexes to answer queries, then it seems like we should try
> to keep them valid so that queries can actually use them for
> something.

Also, ATTACH and DETACH are especially useless when it comes to UNIQUE
indexes. If we simply want to replace out a bloated index using a
DETACH quickly followed by an ATTACH then it leaves a non-zero window
of time that we can't be certain that the uniqueness is enforced. This
would still work on an individual partition level, but if we ever want
to reference a UNIQUE partitioned index in a foreign key constraint
then what happens to the foreign key when the index is in the invalid
state? Should we just disallow DETACH when a foreign key exists? or
just invalidate the foreign key constraint too? Both seem like a
nightmare from a DBA point-of-view.

You might argue that concurrently recreating an index used by a
foreign key is just as difficult today, but there's no reason to make
this as problematic, is there?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] pow support for pgbench

2017-12-05 Thread Fabien COELHO



In both cases we'd return a double but we use the fast ipow if it's 
possible (which can be 20x faster), so at the cost of an extra cast if 
you need an int, we'd have a consistent API. Would this be acceptable?


It seems OK to me.


Computing as an int, casting to double and back to int8 can generate a 
loss of precision. However for powers of 2 it works exactly, so eg 
computing a mask it would be ok.


This proposal does not exactly match SQL behavior, but I do not see this 
as a problem, which is why I was happy with the previous proposal.


--
Fabien.

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Amit Kapila

On Wed, Dec 6, 2017 at 12:18 AM, Robert Haas  wrote:
> On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro
>  wrote:
>> As for how to aggregate the information, isn't it reasonable to show
>> data from the last loop on the basis that it's representative?
>> Summing wouldn't make too much sense, because you didn't use that much
>> memory all at once.
>
> Sorts can be rescanned even without parallel query, so I guess we
> should try to make the parallel case kinda like the non-parallel case.
> If I'm not wrong, that will just use the stats from the most recent
> execution (i.e. the last loop) -- see show_sort_info() in explain.c.
>

Right and seeing that I have prepared the patch (posted above [1])
which fixes it such that it will resemble the non-parallel case.
Ideally, it would have obviated the need for my previous patch which
got committed as 778e78ae.  However, now that is committed, I could
think of below options:

1. I shall rebase it atop what is committed and actually, I have done
that in the attached patch.  I have also prepared a regression test
case patch just to show the output with and without the patch.
2. For sort node, we can fix it by having some local_info same as
shared_info in sort node and copy the shared_info in that or we could
reinstate the pointer to the DSM in ExecSortReInitializeDSM() by
looking it up in the TOC as suggested by Thomas. If we go this way,
then we need a similar fix for hash node as well.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JBj4YCEQKeTua5%3DBMXy7zW7zNOvoXomzBP%3Dkb_aqMF7w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

fix_accum_instr_parallel_workers_v5.patch
Description: Binary data

test_sort_stats_v1.1.patch
Description: Binary data

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-05 Thread Amit Kapila

On Wed, Dec 6, 2017 at 1:10 AM, Robert Haas  wrote:
> On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila  wrote:
>> Yeah, that sounds better, so modified the patch accordingly.
>
> I committed this to master and REL_10_STABLE, but it conflicts all
> over the place on 9.6.
>

I will try to prepare the patch for 9.6, but I think it might be
better if we first decide what to do about the open issue for sort and
hash node as there can be some overlap based on what approach we
choose to fix it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Error bundling script file with extension when event trigger commands are used

2017-12-05 Thread sanyam jain

Hi,

I am trying to bundle a script file with my extension. Script has event 
triggers and trigger function does a select query on 
pg_event_trigger_ddl_commands .

While executing "create extension" following error is thrown :


ERROR:  pg_event_trigger_ddl_commands() can only be called in an event trigger 
function
CONTEXT:  SQL statement "SELECT * FROM pg_event_trigger_ddl_commands() WHERE 
object_type ='table'


Found that there is already a bug posted related to this
https://www.postgresql.org/message-id/flat/20170913075559.25630.41587%40wrigleys.postgresql.org#20170913075559.25630.41...@wrigleys.postgresql.org

I couldn't understand much from the above thread but like to solve this problem.

Thanks,
Sanyam Jain

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-05 Thread Amit Kapila

On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas  wrote:
> On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
>  wrote:
>> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila  wrote:
>>> The patch still applies (with some hunks).  I have added it in CF [1]
>>> to avoid losing track.
>>>
>>> [1] - https://commitfest.postgresql.org/15/1341/
>>
>> This did not get reviews and the patch still applies. I am moving it to next 
>> CF.
>
> I'd like to break this into two patches, one to fix the general
> background worker problems and another to fix the problems specific to
> parallel query.
>
> The attached version extracts the two fixes for general background
> worker problems from Amit's patch and adjust the comments a bit more
> extensively than he did.
>

Looks good to me.

> Barring objections, I'll commit and
> back-patch this; then we can deal with the other part of this
> afterward.
>

Sure, I will rebase on top of the commit unless you have some comments
on the remaining part.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Error bundling script file with extension when event trigger commands are used

2017-12-05 Thread Michael Paquier

On Wed, Dec 6, 2017 at 3:36 PM, sanyam jain  wrote:
> I couldn't understand much from the above thread but like to solve this
> problem.

Not sure that people are much motivated in solving this problem either
as the use-case is very narrow. Based on my memories of the moment
where I looked at this problem, I was thinking that you would need
additional catalog functions able to switch a session-level flag which
disables completely event triggers to happen during the execution of
those commands... So it would mean that such a fix cannot be
back-patched. If you would like to design or implement a patch, of
course feel free! One hacky way to solve this problem would be to
create the event triggers out of the extension creation, meaning that
the object dependencies would need to be tracked by yourself.
-- 
Michael

Re: [HACKERS] Runtime Partition Pruning

2017-12-05 Thread David Rowley

On 2 December 2017 at 08:04, Robert Haas  wrote:
> On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson  wrote:
>> David Q1:
>> postgres=#  explain analyse execute ab_q1 (3,3); --const
>>QUERY PLAN
>> -
>>  Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
>> rows=0 loops=1)
>>->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
>> time=0.005..0.005 rows=0 loops=1)
>>  Filter: ((a = 3) AND (b = 3))
>>  Planning time: 0.588 ms
>>  Execution time: 0.043 ms
>> (5 rows)
>
> I think the EXPLAIN ANALYZE input should show something attached to
> the Append node so that we can tell that partition pruning is in use.
> I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes"
> or if we can give a few more useful details.

It already does. Anything subnode with "(never executed)" was pruned
at runtime. Do we really need anything else to tell us that?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

67 matches

Mail list logo