Re: Fix for pg_stat_activity putting client hostaddr into appname field

2018-03-29 Thread Michael Paquier
On Tue, Mar 27, 2018 at 03:47:07PM +1300, Edmund Horner wrote:
> But the stats array includes auxiliary processes, which means it has
> NumBackendStatSlots items.  The pointers for the aux process strings
> are out of range.  In the case of my query, the pointers for
> st_appname in the aux processes happen to point into the
> BackendClientHostnameBuffer segment.
> 
> This patch uses NumBackendStatSlots for the sizes of those segments,
> rather than MaxBackends.

I am adding in CC Robert and Kuntal who worked on that (issue added as
well to the older bug section in v11 open item list).

I have read through your patch and it seems to me that you are right
here.  The error comes from the original commit fc70a4b0, which has
added auxiliary processes to pg_stat_activity.  Even
BackendStatusShmemSize uses NumBackendStatSlots for the calculation of
the array's lengths, but CreateSharedBackendStatus still mixes it with
NumBackends.

> I considered whether aux processes really need those strings
> (especially st_clienthostname), but decided it was more consistent
> just to assume that they might.  (It's an extra 3 kB... if we want to
> save that we can put a bunch of if statements in pgstat_bestart and
> other places.)

BackendStatusShmemSize has been originally changed to use
NumBackendStatSlots, so the intention is visibly to be consistent in the
number of backends used, even if this means consuming a bit more memory,
so the idea was to focus on consistency and simplicity.

> The patch also allocates local memory for st_clienthostname in
> pgstat_read_current_status.  These strings shouldn't change very
> often, but it seems safer and more consistent to treat them as we
> treat the other two string fields.  Without the string copy, I think a
> client disconnect and be replaced after the stats have been copied,
> resulting in the new hostname appearing alongside the copied stats
> fields from the old connection.

Agreed on that as well.
--
Michael


signature.asc
Description: PGP signature


Re: Commit fest 2017-11

2018-03-29 Thread Alexander Korotkov
Hi, Fabien!

On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO  wrote:

> And the last 21 patches have been classified as well. Here is the
>> final score for this time:
>> Committed: 55.
>> Moved to next CF: 103.
>> Rejected: 1.
>> Returned with Feedback: 47.
>> Total: 206.
>>
>> Thanks to all the contributors for this session! The CF is now closed.
>>
>
> Thanks for the CF management.
>
> Attached a small graph of the end status of patch at the end of each CF.
>

Thank you for the graph!
It would be interesting to see statistics not by patches count, but by
their complexity.
For rough measure of complexity we can use number of affected lines.  I
expect that
statistics would be even more distressing: small patches can be committed
faster,
while large patches are traversing from one CF to another during long
time.  Interesting
to check whether it's true...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Commit fest 2017-11

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 10:06 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi, Fabien!
>
> On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO  wrote:
>
>> And the last 21 patches have been classified as well. Here is the
>>> final score for this time:
>>> Committed: 55.
>>> Moved to next CF: 103.
>>> Rejected: 1.
>>> Returned with Feedback: 47.
>>> Total: 206.
>>>
>>> Thanks to all the contributors for this session! The CF is now closed.
>>>
>>
>> Thanks for the CF management.
>>
>> Attached a small graph of the end status of patch at the end of each CF.
>>
>
> Thank you for the graph!
> It would be interesting to see statistics not by patches count, but by
> their complexity.
> For rough measure of complexity we can use number of affected lines.  I
> expect that
> statistics would be even more distressing: small patches can be committed
> faster,
> while large patches are traversing from one CF to another during long
> time.  Interesting
> to check whether it's true...
>
>
I think that's very hard to do given that we simply don't have the data
today. It's not that simple to analyze the patches in the archives, because
some are single file, some are spread across multiple etc. I fear that
anything trying to work off that would actually make the stats even more
inaccurate. This is the pattern I've seen whenever I've treid tha tbefore.

I wonder if we should consider adding a field to the CF app *specifically*
to track things like this. What I'm thinking is a field that's set (or at
least verified) by the person who flags a patch as committed with choices
like Trivial/Simple/Medium/Complex (trivial being things like typo fixes
etc, which today can hugely skew the stats).

Would people who update the CF app be willing to put in that effort
(however small it is, it has to be done consistently to be of any value) in
order to be able to track such statistics?

It would only help for the future of course, unless somebody wants to go
back and backfill existing patches with such information (which they might
be).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Commit fest 2017-11

2018-03-29 Thread Simon Riggs
On 29 March 2018 at 09:19, Magnus Hagander  wrote:
>
>
> On Thu, Mar 29, 2018 at 10:06 AM, Alexander Korotkov
>  wrote:
>>
>> Hi, Fabien!
>>
>> On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO  wrote:

 And the last 21 patches have been classified as well. Here is the
 final score for this time:
 Committed: 55.
 Moved to next CF: 103.
 Rejected: 1.
 Returned with Feedback: 47.
 Total: 206.

 Thanks to all the contributors for this session! The CF is now closed.
>>>
>>>
>>> Thanks for the CF management.
>>>
>>> Attached a small graph of the end status of patch at the end of each CF.
>>
>>
>> Thank you for the graph!
>> It would be interesting to see statistics not by patches count, but by
>> their complexity.
>> For rough measure of complexity we can use number of affected lines.  I
>> expect that
>> statistics would be even more distressing: small patches can be committed
>> faster,
>> while large patches are traversing from one CF to another during long
>> time.  Interesting
>> to check whether it's true...
>>
>
> I think that's very hard to do given that we simply don't have the data
> today. It's not that simple to analyze the patches in the archives, because
> some are single file, some are spread across multiple etc. I fear that
> anything trying to work off that would actually make the stats even more
> inaccurate. This is the pattern I've seen whenever I've treid tha tbefore.
>
> I wonder if we should consider adding a field to the CF app *specifically*
> to track things like this. What I'm thinking is a field that's set (or at
> least verified) by the person who flags a patch as committed with choices
> like Trivial/Simple/Medium/Complex (trivial being things like typo fixes
> etc, which today can hugely skew the stats).
>
> Would people who update the CF app be willing to put in that effort (however
> small it is, it has to be done consistently to be of any value) in order to
> be able to track such statistics?
>
> It would only help for the future of course, unless somebody wants to go
> back and backfill existing patches with such information (which they might
> be).

The focus of this is on the Committers, which seems wrong.

I suggest someone does another analysis that shows how many patch
reviews have been conducted by patch authors, so we can highlight
people who are causing the problem yet not helping solve the problem.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Commit fest 2017-11

2018-03-29 Thread Alexander Korotkov
On Thu, Mar 29, 2018 at 11:19 AM, Magnus Hagander 
wrote:

> On Thu, Mar 29, 2018 at 10:06 AM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> Hi, Fabien!
>>
>> On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO 
>> wrote:
>>
>>> And the last 21 patches have been classified as well. Here is the
 final score for this time:
 Committed: 55.
 Moved to next CF: 103.
 Rejected: 1.
 Returned with Feedback: 47.
 Total: 206.

 Thanks to all the contributors for this session! The CF is now closed.

>>>
>>> Thanks for the CF management.
>>>
>>> Attached a small graph of the end status of patch at the end of each CF.
>>>
>>
>> Thank you for the graph!
>> It would be interesting to see statistics not by patches count, but by
>> their complexity.
>> For rough measure of complexity we can use number of affected lines.  I
>> expect that
>> statistics would be even more distressing: small patches can be committed
>> faster,
>> while large patches are traversing from one CF to another during long
>> time.  Interesting
>> to check whether it's true...
>>
>>
> I think that's very hard to do given that we simply don't have the data
> today. It's not that simple to analyze the patches in the archives, because
> some are single file, some are spread across multiple etc. I fear that
> anything trying to work off that would actually make the stats even more
> inaccurate. This is the pattern I've seen whenever I've treid tha tbefore.
>
> I wonder if we should consider adding a field to the CF app *specifically*
> to track things like this. What I'm thinking is a field that's set (or at
> least verified) by the person who flags a patch as committed with choices
> like Trivial/Simple/Medium/Complex (trivial being things like typo fixes
> etc, which today can hugely skew the stats).
>
> Would people who update the CF app be willing to put in that effort
> (however small it is, it has to be done consistently to be of any value) in
> order to be able to track such statistics?
>
> It would only help for the future of course, unless somebody wants to go
> back and backfill existing patches with such information (which they might
> be).
>

We have http://commitfest.cputube.org/ which automatically applies patches
and runs
regression tests.  This tool is probably not handle all the cases.  In
particular
it doesn't work when patchset in one thread is depending on patchset in
another
thread.  But it would be definitely enough for statistics.

I also think we should eventually integrate functionality of
http://commitfest.cputube.org/
into our commitfest application..

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Commit fest 2017-11

2018-03-29 Thread Michael Paquier
On Thu, Mar 29, 2018 at 09:38:28AM +0100, Simon Riggs wrote:
> I suggest someone does another analysis that shows how many patch
> reviews have been conducted by patch authors, so we can highlight
> people who are causing the problem yet not helping solve the problem.

This data is already partially available.  Just go to the bottom of the
CF app, then click on "Reports" -> "Author Stats".  This does not give a
trend of the patch difficulty though.
--
Michael


signature.asc
Description: PGP signature


Re: Commit fest 2017-11

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 10:37 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Mar 29, 2018 at 11:19 AM, Magnus Hagander 
> wrote:
>
>> On Thu, Mar 29, 2018 at 10:06 AM, Alexander Korotkov <
>> a.korot...@postgrespro.ru> wrote:
>>
>>> Hi, Fabien!
>>>
>>> On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO 
>>> wrote:
>>>
 And the last 21 patches have been classified as well. Here is the
> final score for this time:
> Committed: 55.
> Moved to next CF: 103.
> Rejected: 1.
> Returned with Feedback: 47.
> Total: 206.
>
> Thanks to all the contributors for this session! The CF is now closed.
>

 Thanks for the CF management.

 Attached a small graph of the end status of patch at the end of each CF.

>>>
>>> Thank you for the graph!
>>> It would be interesting to see statistics not by patches count, but by
>>> their complexity.
>>> For rough measure of complexity we can use number of affected lines.  I
>>> expect that
>>> statistics would be even more distressing: small patches can be
>>> committed faster,
>>> while large patches are traversing from one CF to another during long
>>> time.  Interesting
>>> to check whether it's true...
>>>
>>>
>> I think that's very hard to do given that we simply don't have the data
>> today. It's not that simple to analyze the patches in the archives, because
>> some are single file, some are spread across multiple etc. I fear that
>> anything trying to work off that would actually make the stats even more
>> inaccurate. This is the pattern I've seen whenever I've treid tha tbefore.
>>
>> I wonder if we should consider adding a field to the CF app
>> *specifically* to track things like this. What I'm thinking is a field
>> that's set (or at least verified) by the person who flags a patch as
>> committed with choices like Trivial/Simple/Medium/Complex (trivial being
>> things like typo fixes etc, which today can hugely skew the stats).
>>
>> Would people who update the CF app be willing to put in that effort
>> (however small it is, it has to be done consistently to be of any value) in
>> order to be able to track such statistics?
>>
>> It would only help for the future of course, unless somebody wants to go
>> back and backfill existing patches with such information (which they might
>> be).
>>
>
> We have http://commitfest.cputube.org/ which automatically applies
> patches and runs
> regression tests.  This tool is probably not handle all the cases.  In
> particular
> it doesn't work when patchset in one thread is depending on patchset in
> another
> thread.  But it would be definitely enough for statistics.
>
> I also think we should eventually integrate functionality of
> http://commitfest.cputube.org/
> into our commitfest application..
>
>
Yes, there is (stalled, but still) work in progress on that one. I think
that's a separate thing though.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Typo in shared_record_table_compare() commentary

2018-03-29 Thread Arthur Zakirov
Hello hackers,

During studying dshash I've found a little typo. There is no
SharedRecordTableKey struct in the code, I think the commentary refers
to SharedRecordTableKey struct.

The patch is attached.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/utils/cache/typcache.c 
b/src/backend/utils/cache/typcache.c
index 874d8cd1c9..e38fd16eb0 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -198,7 +198,7 @@ typedef struct SharedTypmodTableEntry
 } SharedTypmodTableEntry;
 
 /*
- * A comparator function for SharedTupleDescTableKey.
+ * A comparator function for SharedRecordTableKey.
  */
 static int
 shared_record_table_compare(const void *a, const void *b, size_t size,


Re: Commit fest 2017-11

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 10:38 AM, Simon Riggs  wrote:

> On 29 March 2018 at 09:19, Magnus Hagander  wrote:
> >
> >
> > On Thu, Mar 29, 2018 at 10:06 AM, Alexander Korotkov
> >  wrote:
> >>
> >> Hi, Fabien!
> >>
> >> On Fri, Dec 1, 2017 at 9:10 AM, Fabien COELHO 
> wrote:
> 
>  And the last 21 patches have been classified as well. Here is the
>  final score for this time:
>  Committed: 55.
>  Moved to next CF: 103.
>  Rejected: 1.
>  Returned with Feedback: 47.
>  Total: 206.
> 
>  Thanks to all the contributors for this session! The CF is now closed.
> >>>
> >>>
> >>> Thanks for the CF management.
> >>>
> >>> Attached a small graph of the end status of patch at the end of each
> CF.
> >>
> >>
> >> Thank you for the graph!
> >> It would be interesting to see statistics not by patches count, but by
> >> their complexity.
> >> For rough measure of complexity we can use number of affected lines.  I
> >> expect that
> >> statistics would be even more distressing: small patches can be
> committed
> >> faster,
> >> while large patches are traversing from one CF to another during long
> >> time.  Interesting
> >> to check whether it's true...
> >>
> >
> > I think that's very hard to do given that we simply don't have the data
> > today. It's not that simple to analyze the patches in the archives,
> because
> > some are single file, some are spread across multiple etc. I fear that
> > anything trying to work off that would actually make the stats even more
> > inaccurate. This is the pattern I've seen whenever I've treid tha
> tbefore.
> >
> > I wonder if we should consider adding a field to the CF app
> *specifically*
> > to track things like this. What I'm thinking is a field that's set (or at
> > least verified) by the person who flags a patch as committed with choices
> > like Trivial/Simple/Medium/Complex (trivial being things like typo fixes
> > etc, which today can hugely skew the stats).
> >
> > Would people who update the CF app be willing to put in that effort
> (however
> > small it is, it has to be done consistently to be of any value) in order
> to
> > be able to track such statistics?
> >
> > It would only help for the future of course, unless somebody wants to go
> > back and backfill existing patches with such information (which they
> might
> > be).
>
> The focus of this is on the Committers, which seems wrong.
>
> I suggest someone does another analysis that shows how many patch
> reviews have been conducted by patch authors, so we can highlight
> people who are causing the problem yet not helping solve the problem.
>

I have exactly such analysis available. The problem with it is that it
cannot take complexity into account.

Reviewing a one-letter typo patch is not the same as reviewing MERGE, JIT
or parallel query... But right now, we don't have enough proper data to
differentiate those.

The idea would not to put the focus on the committer necessarily, but on
the person who marks the patch as committed in the CF app. We can of course
have the submitter also input this as a metadata, but in the end it's going
to be the reviewers and committers who are the only  ones who can judge
what the *actual* complexity of a patch is/was.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Simon Riggs
On 29 March 2018 at 01:49, Peter Geoghegan  wrote:

>>> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
>>> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
>>> strength doesn't have anything to do with IndexBuildHeapRangeScan()
>>> when it operates with an MVCC snapshot. I think that this means that
>>> this patch doesn't need to update comments within
>>> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
>>> independent.
>>
>>
>> Ok, I agree. But note that we are now invoking that code with
>> AccessShareLock() whereas the existing callers either use ShareLock or
>> ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
>> worth noting.
>
> Fair point, even though the ShareUpdateExclusiveLock case isn't
> actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
> one up to the committer, too? I find it very hard to argue either for
> or against this, and I want to avoid "analysis paralysis" at this
> important time.

The above discussion doesn't make sense to me, hope someone will explain.

I understand we are adding a check to verify heap against index and
this will take longer than before. When it runs does it report
progress of the run via pg_stat_activity, so we can monitor how long
it will take?

Locking is also an important concern.

If we need a ShareLock to run the extended check and the check runs
for a long time, when would we decide to run that? This sounds like it
will cause a production outage, so what are the pre-conditions that
would lead us to say "we'd better run this". For example, if running
this is known to be signficantly faster than running CREATE INDEX,
that might be an argument for someone to run this first if index
corruption is suspected.

If it detects an issue, can it fix the issue for the index by
injecting correct entries? If not then we will have to run CREATE
INDEX afterwards anyway, which makes it more likely that people would
just run CREATE INDEX and not bother with the check.

So my initial questions are about when we would run this and making
sure that is documented.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel safety of binary_upgrade_create_empty_extension

2018-03-29 Thread Thomas Munro
On Wed, Mar 28, 2018 at 6:18 AM, Robert Haas  wrote:
> On Tue, Mar 27, 2018 at 11:17 AM, Tom Lane  wrote:
>> regression=# create sequence s1;
>> CREATE SEQUENCE
>> regression=# begin;
>> BEGIN
>> regression=# set force_parallel_mode to 1;
>> SET
>> regression=# declare c cursor for select nextval('s1');
>> DECLARE CURSOR
>> regression=# select cursor_to_xml('c',1,true,true,'');
>> ERROR:  cannot execute nextval() during a parallel operation
>>
>> I think this behavior is a bug.
>
> I agree.

Presumably also cursor_to_xmlschema.  I also found some more
suspicious brin and gin modifying functions.  So I think the list of
functions that needs to be marked 'u' so far is:

* binary_upgrade_create_empty_extension
* pg_import_system_collations
* brin_summarize_range
* brin_summarize_new_values
* brin_desummarize_range
* gin_clean_pending_list
* cursor_to_xml
* cursor_to_xmlschema

Has anyone got anything else?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Typo in shared_record_table_compare() commentary

2018-03-29 Thread Thomas Munro
On Thu, Mar 29, 2018 at 10:16 PM, Arthur Zakirov
 wrote:
> During studying dshash I've found a little typo. There is no
> SharedRecordTableKey struct in the code, I think the commentary refers
> to SharedRecordTableKey struct.

Right, thanks!

> The patch is attached.

+1

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Typo in shared_record_table_compare() commentary

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 11:37 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Thu, Mar 29, 2018 at 10:16 PM, Arthur Zakirov
>  wrote:
> > During studying dshash I've found a little typo. There is no
> > SharedRecordTableKey struct in the code, I think the commentary refers
> > to SharedRecordTableKey struct.
>
> Right, thanks!
>
> > The patch is attached.
>
> +1
>

Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Commit fest 2017-11

2018-03-29 Thread Sergei Kornilov
Hello
I can not find "Reports" in bottom any page of CF app...

Such stats covers only reviews marked in CF app? Through "Comment"->"Review" 
buttons? I'm afraid this statistics will be inaccurate for new users (like me). 
Wiki page https://wiki.postgresql.org/wiki/Reviewing_a_Patch say
> Send reviews as replies to the email containing the patch
And nothing about changes in CF app.

regards, Sergei



Re: Commit fest 2017-11

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 11:47 AM, Sergei Kornilov  wrote:

> Hello
> I can not find "Reports" in bottom any page of CF app...
>
> Such stats covers only reviews marked in CF app? Through
> "Comment"->"Review" buttons? I'm afraid this statistics will be inaccurate
> for new users (like me). Wiki page https://wiki.postgresql.org/
> wiki/Reviewing_a_Patch say
> > Send reviews as replies to the email containing the patch
> And nothing about changes in CF app.
>

That report is currently only available to CF managers. One of the main
reasons for that is that the underlying data is very uncertain and we don't
want people to draw "random" conclusions for it.

And yes, they only cover things marked in the CF app.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-29 Thread Simon Riggs
On 28 March 2018 at 12:00, Pavan Deolasee  wrote:

> v27 attached, though review changes are in
> the add-on 0005 patch.

This all looks good now, thanks for making all of those changes.

I propose [v27 patch1+patch3+patch5] as the initial commit candidate
for MERGE, with other patches following later before end CF.

I propose to commit this tomorrow, 30 March, about 26 hours from now.
That will allow some time for buildfarm fixing/reversion before the
Easter weekend, then other patches to follow starting 2 April. That
then gives reasonable time to follow up on other issues that we will
no doubt discover fairly soon after commit, such as additional runs by
SQLsmith and more eyeballs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-03-29 Thread Rushabh Lathia
Hi,

Consider the below test:

CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
INSERT INTO foo select i,i,i from generate_series(1,4)i;

CREATE TABLE foo_d (like foo);
INSERT INTO foo_d select i,i,i from generate_series(1,9)i;

ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;

Above ATTACH PARTITION should fail with "partition constraint is violated"
error, but instead it's working on a master branch.

Looking further I found that problem introduced with below commit:

commit 4dba331cb3dc1b5ffb0680ed8efae847de216796
Author: Alvaro Herrera 
Date:   Tue Mar 20 11:19:41 2018 -0300

Fix CommandCounterIncrement in partition-related DDL

It makes sense to do the CCIs in the places that do catalog updates,
rather than before the places that error out because the former ones
fail to do it.  In particular, it looks like StorePartitionBound() and
IndexSetParentIndex() ought to make their own CCIs.

Per review comments from Peter Eisentraut for row-level triggers on
partitioned tables.

I noticed that further below commit tried to correct thing in similar
area, but still I am noticing the broken behavior in case of attaching
the partition as DEFAULT.

commit 56163004b8b2151db279744b77138d4d90e2d5cb
Author: Alvaro Herrera 
Date:   Wed Mar 21 12:03:35 2018 -0300

Fix relcache handling of the 'default' partition

Regards,


-- 
Rushabh Lathia


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-29 Thread Teodor Sigaev
The next question what I see: why do not we lock entry leaf pages in some cases? 


I've modified patch to predicate lock each leaf (entry or posting) page. Now 
patch reaches commitable state from my point view.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 37070b3b72..095b1192cb 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/xloginsert.h"
+#include "storage/predicate.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -515,6 +516,19 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			btree->fillRoot(btree, newrootpg,
 			BufferGetBlockNumber(lbuffer), newlpage,
 			BufferGetBlockNumber(rbuffer), newrpage);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(lbuffer));
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
+
 		}
 		else
 		{
@@ -524,6 +538,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
 			GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
 			GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
 		}
 
 		/*
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index f9daaba52e..3fb4fc8264 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -19,6 +19,7 @@
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/rel.h"
 
 /*
@@ -1423,7 +1424,7 @@ disassembleLeaf(Page page)
  * Any segments that acquire new items are decoded, and the new items are
  * merged with the old items.
  *
- * Returns true if any new items were added. False means they were all
+ * Returns true if any new items were added. false means they were all
  * duplicates of existing items on the page.
  */
 static bool
@@ -1759,7 +1760,7 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
  */
 BlockNumber
 createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
-  GinStatsData *buildStats)
+  GinStatsData *buildStats, Buffer entrybuffer)
 {
 	BlockNumber blkno;
 	Buffer		buffer;
@@ -1810,6 +1811,12 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	page = BufferGetPage(buffer);
 	blkno = BufferGetBlockNumber(buffer);
 
+	/*
+	 * Copy a predicate lock from entry tree leaf (containing posting list)
+	 * to  posting tree.
+	 */
+	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
@@ -1904,6 +1911,7 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 		btree.itemptr = insertdata.items[insertdata.curitem];
 		stack = ginFindLeafPage(&btree, false, NULL);
 
+		GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
 		ginInsertValue(&btree, stack, &insertdata, buildStats);
 	}
 }
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6fe67f346d..21f2cbac3b 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -17,8 +17,10 @@
 #include "access/gin_private.h"
 #include "access/relscan.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 
 /* GUC parameter */
 int			GinFuzzySearchLimit = 0;
@@ -33,11 +35,18 @@ typedef struct pendingPosition
 } pendingPosition;
 
 
+static void
+GinPredicateLockPage(Relation index, BlockNumber blkno, Snapshot snapshot)
+{
+	if (!GinGetUseFastUpdate(index))
+			PredicateLockPage(index, blkno, snapshot);
+}
+
 /*
  * Goes to the next page if current offset is outside of bounds
  */
 static bool
-moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
+moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack, Snapshot snapshot)
 {
 	Page		page = BufferGetPage(stack->buffer);
 
@@ -52,6 +61,7 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
 		stack->buffer = ginStepRight(stack->buffer, btree->index, GIN_SHARE);
 		stack->blkno = BufferGetBlockNumber(stack->buffer);
 		stack->off = FirstOffsetNumber;
+		GinPredicateLockPage(btree->index, stack->blkno, snapshot);
 	}
 
 	return true;
@@ -73,6 +83,7 @@ scanPostingTree(Relation index, GinScanEnt

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-29 Thread Ashutosh Bapat
On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat
 wrote:
>
> Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> are basically the conditions on the relation being pushed down.
> havingQuals are conditions on a grouped relation so treating them like
> baserestrictinfo or join conditions looks more straight forward,
> rather than having a separate member in PgFdwRelationInfo. So, remote
> havingQuals go into remote_conds and local havingQuals go to
> local_conds.

Looks like we already do that. Then we have remote_conds, local_conds
which together should be equivalent to havingQual. Storing all those
three doesn't make sense. In future someone may use havingQual instead
of remote_conds/local_conds just because its available and then there
is risk of these three lists going out of sync.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Proposal: http2 wire format

2018-03-29 Thread Hannu Krosing
>
> > * room for other resultset formats later. Like Damir, I really want to
> add
> > protobuf or json serializations of result sets at some point, mainly so
> we
> > can return "entity graphs" in graph representation rather than left-join
> > projection.
>
> -1. I don't think this belongs in postgres.
>

Maybe the functionality does not belong in *core* postgres, but I sure
would like it to be possible to have an extension being able to do it.

A bit similar to what logical decoding plugins do now, just more flexible
in terms of protocol

Cheers
Hannu


Re: [HACKERS] path toward faster partition pruning

2018-03-29 Thread David Rowley
On 29 March 2018 at 21:35, Amit Langote  wrote:
> Beside fixing that, I have decided to get rid of the
> PartititionPruneStepOpNe (a special kind of base pruning step that was
> being used to prune list partitions using a set of <> operator clauses)
> and related functions.  Instead pruning for <> operator clauses is now
> implemented by using a combination of PartitionPruneStepOp and
> PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I
> also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and
> COMBINE_INTERSECT, respectively).  I decided to do so because the previous
> arrangement looked like a "hack" to support a special case that touched no
> less than quite a few places.

Hi Amit,

I've looked at the v44 patch. Thanks for making those changes.

The new not-equal handling code is not quite right.

DROP TABLE listp;
CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);
CREATE TABLE listp_default PARTITION OF listp DEFAULT;

EXPLAIN SELECT * FROM listp WHERE a <> 1;
QUERY PLAN
--
 Append  (cost=0.00..54.56 rows=2537 width=4)
   ->  Seq Scan on listp1_3  (cost=0.00..41.88 rows=2537 width=4)
 Filter: (a <> 1)
(3 rows)

The default should be included here.

INSERT INTO listp VALUES(1),(2),(3);
SELECT * FROM listp WHERE a <> 1;
 a
---
 3
(1 row)

This code assumes its fine to just reverse the setting for default:

result->scan_default = !source->scan_default;

More complex handling is needed here.

I've attached a diff for a small set of other things I noticed while reviewing.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v44_drowley_review.patch
Description: Binary data


hot_standby_feedback vs excludeVacuum and snapshots

2018-03-29 Thread Greg Stark
I'm poking around to see debug a vacuuming problem and wondering if
I've found something more serious.

As far as I can tell the snapshots on HOT standby are built using a
list of running xids that the primary builds and puts in the WAL and
that seems to include all xids from transactions running in all
databases. The HOT standby would then build a snapshot and eventually
send the xmin of that snapshot back to the primary in the hot standby
feedback and that would block vacuuming tuples that might be visible
to the standby.

Many ages ago Alvaro sweated blood to ensure vacuums could run for
long periods of time without holding back the xmin horizon and
blocking other vacuums from cleaning up tuples. That's the purpose of
the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible
because we know vacuums won't insert any tuples that queries might try
to view and also vacuums won't try to perform any sql queries on other
tables.

I can't find anywhere that the standby snapshot building mechanism
gets this same information about which xids are actually vacuums that
can be ignored when building a snapshot. So I'm concerned that the hot
standby sending back its xmin would be effectively undermining this
mechanism and forcing vacuum xids to be included in the xmin horizon
and prevent vacuuming of tuples.

Am I missing something obvious? Is this a known problem?

-- 
greg



Fix src/test/subscription/t/003_constraints.pl header comment

2018-03-29 Thread Alexander Korotkov
Hi!

src/test/subscription/t/003_constraints.pl starts from the line

# Basic logical replication test

It seems that this line was copy-pasted from src/test/subscription/t/
001_rep_changes.pl.
Fix is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


fix_003_constraints_header_comment.patch
Description: Binary data


Re: Prefix operator for text and spgist support

2018-03-29 Thread Ildus Kurbangaliev
On Fri, 23 Mar 2018 11:45:33 +0300
Alexander Korotkov  wrote:

> On Thu, Mar 22, 2018 at 7:09 PM, Teodor Sigaev 
> wrote:
> 
> > Patch looks resonable, but I see some place to improvement:
> > spg_text_leaf_consistent() only needs to check with
> > text_startswith() if reconstucted value came to leaf consistent is
> > shorter than given prefix. For example, if level >= length of
> > prefix then we guarantee that fully reconstracted is matched too.
> > But do not miss that you may need to return value for index only
> > scan, consult returnData field
> >
> > In attachment rebased and minorly edited version of your patch.  
> 
> 
> I took a look at this patch.  In addition to Teodor's comments I can
> note following.
> 
> * This line looks strange for me.
> 
> - if (strategy > 10)
> + if (strategy > 10 && strategy !=
> RTPrefixStrategyNumber)
> 
> It's not because we added strategy != RTPrefixStrategyNumber condition
> there.
> It's because we already used magic number here and now have a mix of
> magic number and macro constant in one line.  Once we anyway touch
> this place, could we get rid of magic numbers here?
> 
> * I'm a little concern about operator name.  We're going to choose @^
> operator for
> prefix search without any preliminary discussion.  However,
> personally I don't
> have better ideas :)

Teodor, Alexander, thanks for review. In new version I have added the
optimization in spgist using level variable and also got rid of magic
numbers.

About the operator it's actually ^@ (not @^ :)), I thought about it and
don't really have any idea what operator can be used instead.

Attached version 5 of the patch.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9d1772f349..a1b4724a88 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -2274,6 +2274,21 @@
ph
   
 
+  
+   
+
+ text_startswith
+
+text_startswith(string, prefix)
+   
+   bool
+   
+Returns true if string starts with prefix.
+   
+   text_startswith('alphabet', 'alph')
+   t
+  
+
   

 
@@ -4033,6 +4048,12 @@ cast(-44 as bit(12))   11010100
 ILIKE, respectively.  All of these operators are
 PostgreSQL-specific.

+
+   
+There is also the prefix operator ^@ and corresponding
+text_startswith function which covers cases when only
+searching by beginning of the string is needed.
+   
   
 
 
diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index e47f70be89..06b7519052 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -161,6 +161,7 @@
~<~
~>=~
~>~
+   ^@
   
  
  
diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index f156b2166e..c5e27dd7aa 100644
--- a/src/backend/access/spgist/spgtextproc.c
+++ b/src/backend/access/spgist/spgtextproc.c
@@ -67,6 +67,20 @@
  */
 #define SPGIST_MAX_PREFIX_LENGTH	Max((int) (BLCKSZ - 258 * 16 - 100), 32)
 
+/*
+ * Strategy for collation aware operator on text is equal to btree strategy
+ * plus value of 10.
+ *
+ * Current collation aware strategies and their corresponding btree strategies:
+ * 11 BTLessStrategyNumber
+ * 12 BTLessEqualStrategyNumber
+ * 14 BTGreaterEqualStrategyNumber
+ * 15 BTGreaterStrategyNumber
+ */
+#define SPG_STRATEGY_ADDITION	(10)
+#define SPG_IS_COLLATION_AWARE_STRATEGY(s) ((s) > SPG_STRATEGY_ADDITION \
+		 && (s) != RTPrefixStrategyNumber)
+
 /* Struct for sorting values in picksplit */
 typedef struct spgNodePtr
 {
@@ -496,10 +510,10 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 			 * well end with a partial multibyte character, so that applying
 			 * any encoding-sensitive test to it would be risky anyhow.)
 			 */
-			if (strategy > 10)
+			if (SPG_IS_COLLATION_AWARE_STRATEGY(strategy))
 			{
 if (collate_is_c)
-	strategy -= 10;
+	strategy -= SPG_STRATEGY_ADDITION;
 else
 	continue;
 			}
@@ -526,6 +540,10 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 	if (r < 0)
 		res = false;
 	break;
+case RTPrefixStrategyNumber:
+	if (r != 0)
+		res = false;
+	break;
 default:
 	elog(ERROR, "unrecognized strategy number: %d",
 		 in->scankeys[j].sk_strategy);
@@ -605,10 +623,31 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
 		int			queryLen = VARSIZE_ANY_EXHDR(query);
 		int			r;
 
-		if (strategy > 10)
+		if (strategy == RTPrefixStrategyNumber)
+		{
+			if (!in->returnData && level >= queryLen)
+			{
+/*
+ * If we got to the leaf when level is equal or longer than
+ * query then it is a prefix. We skip cases when returnData is
+ * true, it would mean then leaf could be visited anyway.
+ */
+continue;
+			}
+
+			res = DatumGetBool(DirectFunctionCall2(text_sta

Re: committing inside cursor loop

2018-03-29 Thread Andrew Gierth
> "Peter" == Peter Eisentraut  writes:

 Peter> Committed, thanks!

So... what is a pl/* that does _not_ use pinned cursors for cursor loops
supposed to do in this case?

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-03-29 Thread Teodor Sigaev
Conception of max-retry option seems strange for me. if number of retries 
reaches max-retry option, then we just increment counter of failed transaction 
and try again (possibly, with different random numbers). At the end we should 
distinguish number of error transaction and failed transaction, to found this 
difference documentation  suggests to rerun pgbench with debugging on.


May be I didn't catch an idea, but it seems to me max-tries should be removed. 
On transaction searialization or deadlock error pgbench should increment counter 
of failed transaction, resets conditional stack, variables, etc but not a random 
generator and then start new transaction for the first line of script.


Marina Polyakova wrote:

On 26-03-2018 18:53, Fabien COELHO wrote:

Hello Marina,


Hello!


Many thanks to both of you! I'm working on a patch in this direction..


I think that the best approach for now is simply to reset (command
zero, random generator) and start over the whole script, without
attempting to be more intelligent. The limitations should be clearly
documented (one transaction per script), though. That would be a
significant enhancement already.


I'm not sure that we can always do this, because we can get new errors until 
we finish the failed transaction block, and we need destroy the conditional 
stack..


Sure. I'm suggesting so as to simplify that on failures the retry
would always restarts from the beginning of the script by resetting
everything, indeed including the conditional stack, the random
generator state, the variable values, and so on.

This mean enforcing somehow one script is one transaction.

If the user does not do that, it would be their decision and the
result becomes unpredictable on errors (eg some sub-transactions could
be executed more than once).

Then if more is needed, that could be for another patch.


Here is the fifth version of the patch for pgbench (based on the commit 
4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement these 
ideas, thanks to your comments and those of Teodor Sigaev. Since we may need to 
execute commands to complete a failed transaction block, the script is now 
always executed completely. If there is a serialization/deadlock failure which 
can be retried, the script is executed again with the same random state and 
array of variables as before its first run. Meta commands  errors as well as all 
SQL errors do not cause the aborting of the client. The first failure in the 
current script execution determines whether the script run will be retried or 
not, so only such failures (they have a retry) or errors (they are not retried) 
are reported.


I tried to make fixes in accordance with your previous reviews ([1], [2], [3]):


I'm unclear about the added example added in the documentation. There
are 71% errors, but 100% of transactions are reported as processed. If
there were errors, then it is not a success, so the transaction were
not
processed? To me it looks inconsistent. Also, while testing, it seems
that
failed transactions are counted in tps, which I think is not
appropriate:


About the feature:

 sh> PGOPTIONS='-c default_transaction_isolation=serializable' \
   ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4
 starting vacuum...end.
 progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed
 # NOT 10845.8 TPS...
 progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed
 progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed
 ...
 number of transactions actually processed: 32028 # NO!
 number of errors: 30969 (96.694 %)
 latency average = 2.833 ms
 latency stddev = 1.508 ms
 tps = 10666.720870 (including connections establishing) # NO
 tps = 10683.034369 (excluding connections establishing) # NO
 ...

For me this is all wrong. I think that the tps report is about
transactions
that succeeded, not mere attempts. I cannot say that a transaction
which aborted
was "actually processed"... as it was not.


Fixed


The order of reported elements is not logical:

 maximum number of transaction tries: 100
 scaling factor: 10
 query mode: prepared
 number of clients: 4
 number of threads: 2
 duration: 3 s
 number of transactions actually processed: 967
 number of errors: 152 (15.719 %)
 latency average = 9.630 ms
 latency stddev = 13.366 ms
 number of transactions retried: 623 (64.426 %)
 number of retries: 32272

I would suggest to group everything about error handling in one block,
eg something like:

 scaling factor: 10
 query mode: prepared
 number of clients: 4
 number of threads: 2
 duration: 3 s
 number of transactions actually processed: 967
 number of errors: 152 (15.719 %)
 number of transactions retried: 623 (64.426 %)
 number of retries: 32272
 maximum number of transaction tries: 100
 latency average = 9.630 ms
 latency stddev = 13.366 ms


Fixed


Also, percent character should be stuck to its number: 15.719% to have
the style more homogeneous (although there seems to be pre-existing

Re: csv format for psql

2018-03-29 Thread Daniel Verite
David G. Johnston wrote:

> Could someone post how captions, rows-only, and footer pset settings factor
> into this?  Specifically are they fixed to on/off or will they hide/show if
> users request them explicitly?

This is described in the doc with:

+  csv format writes columns separated
+  by fieldsep, applying the CSV quoting rules
+  described in RFC-4180 and compatible with the CSV format
+  of the COPY command.
+  The header with column names is output unless the
+  tuples_only parameter is on.
+  Title and footers are not printed.
+  


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-29 Thread Alvaro Herrera
I don't quite understand the new call in gininsert -- I mean I see that
it wants to check for conflicts even when fastupdate is set, but why?
Maybe, just maybe, it would be better to add a new flag to the
GinCheckForSerializableConflictIn function, that's passed differently
for this one callsite, and then a comment next to it that indicates why
do we test for fastupdates in one case and not the other case.
If you don't like this idea, then I think more commentary on why
fastupdate is not considered in gininsert is warranted.

In startScanEntry, if you "goto restartScanEntry" in the fastupdate
case, are you trying to acquire the same lock again?  Maybe the lock
acquire should occur before the goto target? (If this doesn't matter for
some reason, maybe add a comment about it)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-29 Thread Thomas Munro
On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer  wrote:
> On 28 March 2018 at 11:53, Tom Lane  wrote:
>>
>> Craig Ringer  writes:
>> > TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
> well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed.  That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

>> Moreover, POSIX is entirely clear that successful fsync means all
>> preceding writes for the file have been completed, full stop, doesn't
>> matter when they were issued.
>
> I can't find anything that says so to me. Please quote relevant spec.
>
> I'm working from
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
> states that
>
> "The fsync() function shall request that all data for the open file
> descriptor named by fildes is to be transferred to the storage device
> associated with the file described by fildes. The nature of the transfer is
> implementation-defined. The fsync() function shall not return until the
> system has completed that action or until an error is detected."
>
> My reading is that POSIX does not specify what happens AFTER an error is
> detected. It doesn't say that error has to be persistent and that subsequent
> calls must also report the error. It also says:

FWIW my reading is the same as Tom's.  It says "all data for the open
file descriptor" without qualification or special treatment after
errors.  Not "some".

> I'm not seeking to defend what the kernel seems to be doing. Rather, saying
> that we might see similar behaviour on other platforms, crazy or not. I
> haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence...  This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article.  I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

-- 
Thomas Munro
http://www.enterprisedb.com



Re: csv format for psql

2018-03-29 Thread Daniel Verite
David G. Johnston wrote:

> Or, really, just make --csv take an optional argument which, if present, is
> the delimiter.

I don't think psql can support optional arguments because
psql --option foo
would be ambiguous, since foo could be the option's value or
the name of a database to connect to.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: csv format for psql

2018-03-29 Thread Pavel Stehule
2018-03-29 14:17 GMT+02:00 Daniel Verite :

> David G. Johnston wrote:
>
> > Or, really, just make --csv take an optional argument which, if present,
> is
> > the delimiter.
>
> I don't think psql can support optional arguments because
> psql --option foo
> would be ambiguous, since foo could be the option's value or
> the name of a database to connect to.
>

sure

Pavel


>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: JIT compiling with LLVM v12

2018-03-29 Thread Jesper Pedersen

Hi Andres,

On 03/28/2018 05:27 PM, Andres Freund wrote:

On 2018-03-27 10:34:26 -0700, Andres Freund wrote:

On 2018-03-27 10:05:47 -0400, Peter Eisentraut wrote:

On 3/13/18 19:40, Andres Freund wrote:

I've pushed a revised and rebased version of my JIT patchset.


What is the status of this item as far as the commitfest is concerned?


7/10 committed. Inlining, Explain, Docs remain.


I've pushed these three.



It seems that clang is being picked up as the main compiler in certain 
situations, ala


ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O0 -fno-omit-frame-pointer 
-I../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o 
auth-scram.o auth-scram.c -MMD -MP -MF .deps/auth-scram.Po
ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O0 -fno-omit-frame-pointer 
-I../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o 
be-secure-openssl.o be-secure-openssl.c -MMD -MP -MF 
.deps/be-secure-openssl.Po
/usr/lib64/ccache/clang -Wno-ignored-attributes -fno-strict-aliasing 
-fwrapv -O2  -I../../../src/include  -D_GNU_SOURCE 
-I/usr/include/libxml2  -flto=thin -emit-llvm -c -o be-fsstubs.bc 
be-fsstubs.c
/usr/lib64/ccache/clang -Wno-ignored-attributes -fno-strict-aliasing 
-fwrapv -O2  -I../../../src/include  -D_GNU_SOURCE 
-I/usr/include/libxml2  -flto=thin -emit-llvm -c -o namespace.bc namespace.c


I would expect LLVM to be isolated to the jit/ hierarchy.

Using CC="ccache gcc" and --with-llvm.

And congrats on getting the feature in !

Best regards,
 Jesper



Re: csv format for psql

2018-03-29 Thread Daniel Verite
Isaac Morland wrote:

> The actual reason I'm posting this is because some of the discussion has
> made me a bit confused: there is already a CSV format defined for the COPY
> command and used by the psql \copy. I just want to check that what is being
> discussed here would use the exact same format as the existing CSV COPY

Please see the start of the thread at
https://www.postgresql.org/message-id/a8de371e-006f-4f92-ab72-2bbe3ee78f03%40manitou-mail.org
where what is proposed is compared to what \copy and COPY
already provide.

About the CSV acronym itself, the COPY documentation
https://www.postgresql.org/docs/current/static/sql-copy.html
already refers to it as "Comma Separated Values", even though
as we know, the delimiter is user-configurable.
There's no difference in psql that would justify a different wording.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: JIT compiling with LLVM v12.2

2018-03-29 Thread John Naylor
Hi Andres,
I spent some time over pouring over the JIT README, and I've attached
a patch with some additional corrections as well as some stylistic
suggestions. The latter may be debatable, but I'm sure you can take
and pick as you see fit. If there are cases where I misunderstood your
intent, maybe that's also useful information. :-)

-John Naylor
diff --git a/src/backend/jit/README b/src/backend/jit/README
index bfed319..7924127 100644
--- a/src/backend/jit/README
+++ b/src/backend/jit/README
@@ -13,12 +13,12 @@ the CPU that just handles that expression, yielding a speedup.
 That this is done at query execution time, possibly even only in cases
 the relevant task is done a number of times, makes it JIT, rather than
 ahead-of-time (AOT). Given the way JIT compilation is used in
-postgres, the lines between interpretation, AOT and JIT are somewhat
+PostgreSQL, the lines between interpretation, AOT and JIT are somewhat
 blurry.
 
 Note that the interpreted program turned into a native program does
 not necessarily have to be a program in the classical sense. E.g. it
-is highly beneficial JIT compile tuple deforming into a native
+is highly beneficial to JIT compile tuple deforming into a native
 function just handling a specific type of table, despite tuple
 deforming not commonly being understood as a "program".
 
@@ -26,7 +26,7 @@ deforming not commonly being understood as a "program".
 Why JIT?
 
 
-Parts of postgres are commonly bottlenecked by comparatively small
+Parts of PostgreSQL are commonly bottlenecked by comparatively small
 pieces of CPU intensive code. In a number of cases that is because the
 relevant code has to be very generic (e.g. handling arbitrary SQL
 level expressions, over arbitrary tables, with arbitrary extensions
@@ -49,11 +49,11 @@ particularly beneficial for removing branches during tuple deforming.
 How to JIT
 ==
 
-Postgres, by default, uses LLVM to perform JIT. LLVM was chosen
+PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
 because it is developed by several large corporations and therefore
 unlikely to be discontinued, because it has a license compatible with
-PostgreSQL, and because its LLVM IR can be generated from C
-using the clang compiler.
+PostgreSQL, and because its IR can be generated from C using the Clang
+compiler.
 
 
 Shared Library Separation
@@ -68,20 +68,20 @@ An additional benefit of doing so is that it is relatively easy to
 evaluate JIT compilation that does not use LLVM, by changing out the
 shared library used to provide JIT compilation.
 
-To achieve this code, e.g. expression evaluation, intending to perform
-JIT, calls a LLVM independent wrapper located in jit.c to do so. If
-the shared library providing JIT support can be loaded (i.e. postgres
-was compiled with LLVM support and the shared library is installed),
-the task of JIT compiling an expression gets handed of to shared
-library. This obviously requires that the function in jit.c is allowed
-to fail in case no JIT provider can be loaded.
+To achieve this, code intending to perform JIT (e.g. expression evaluation)
+calls an LLVM independent wrapper located in jit.c to do so. If the
+shared library providing JIT support can be loaded (i.e. PostgreSQL was
+compiled with LLVM support and the shared library is installed), the task
+of JIT compiling an expression gets handed of to the shared library. This
+obviously requires that the function in jit.c is allowed to fail in case
+no JIT provider can be loaded.
 
 Which shared library is loaded is determined by the jit_provider GUC,
 defaulting to "llvmjit".
 
 Cloistering code performing JIT into a shared library unfortunately
 also means that code doing JIT compilation for various parts of code
-has to be located separately from the code doing so without
+has to be located separately from the code that executes without
 JIT. E.g. the JITed version of execExprInterp.c is located in
 jit/llvm/ rather than executor/.
 
@@ -105,17 +105,21 @@ implementations.
 
 Emitting individual functions separately is more expensive than
 emitting several functions at once, and emitting them together can
-provide additional optimization opportunities. To facilitate that the
-LLVM provider separates function definition from emitting them in an
-executable way.
+provide additional optimization opportunities. To facilitate that, the
+LLVM provider separates function definition (LLVM IR) from function
+emission (executable mmap()ed segments).
 
 Creating functions into the current mutable module (a module
 essentially is LLVM's equivalent of a translation unit in C) is done
 using
+
   extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
+
 in which it then can emit as much code using the LLVM APIs as it
 wants. Whenever a function actually needs to be called
+
   extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
+
 returns a pointer to it.
 
 E.g. in the expression evaluation case this set

Re: Parallel Aggregates for string_agg and array_agg

2018-03-29 Thread Tomas Vondra


On 03/29/2018 05:49 AM, David Rowley wrote:
> On 29 March 2018 at 03:05, Tomas Vondra  wrote:
>> On 03/28/2018 03:54 PM, Tom Lane wrote:
>>> I had in mind to look at exprType() of the argument.
>>
>> Right, I'm fine with that.
> 
> Attached is v9, which is based on Tom's v8 but includes the new tests
> and I think the required fix to disable use of the serial/deserial
> function for array_agg().
> 

I have only looked at the diff, but it seems fine to me (in the sense
that it's doing the right checks to disable parallelism only when really
needed).

FWIW I wonder if we might simply fallback to input/output functions when
the send/receive functions are not available. Not sure it's worth it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Alvaro Herrera
Moving thread to pgsql-hackers.

Simon Riggs wrote:
> On 28 March 2018 at 22:23, Andres Freund  wrote:
> 
> > Add documentation for the JIT feature.
> 
> Very nice feature and most welcome but we should call it something
> other than just "JIT"
> 
> JIT means Just In Time, which could be applied to many concepts and
> has been in use for many years in a range of concepts. particularly in
> manufacturing/logistics and project management.

I agree.  In some email threads Andres has been using "JIT" as a verb,
too, such as "JITing expressions" and such; that's a bit shocking, in a
way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
understand what it means, but in user-facing docs and things we should
use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
and so on.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-29 Thread Jeevan Chalke
On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Mar 27, 2018 at 2:43 PM, Jeevan Chalke
>  wrote:
>
> > I am not sure on what we should Assetrt here. Note that we end-up here
> only
> > when doing grouping, and thus I don't think we need any Assert here.
> > Let me know if I missed anything.
>
> Since we are just checking whether it's an upper relation and directly
> using root->upper_targets[UPPERREL_GROUP_AGG], I thought we could add
> an assert to verify that it's really the grouping rel we are dealing
> with. But I guess, we can't really check that from given relation. But
> then for a grouped rel we can get its target from RelOptInfo. So, we
> shouldn't need to root->upper_targets[UPPERREL_GROUP_AGG]. Am I
> missing something? For other upper relations we do not set the target
> yet but then we could assert that there exists one in the grouped
> relation.
>

Yes. We fetch target from the grouped_rel itself.
Added Assert() per out off-list discussion.


>
> >
> >>
> >>
> >> -get_agg_clause_costs(root, (Node *)
> >> root->parse->havingQual,
> >> +get_agg_clause_costs(root, fpinfo->havingQual,
> >>   AGGSPLIT_SIMPLE, &aggcosts);
> >>  }
> >> Should we pass agg costs as well through GroupPathExtraData to avoid
> >> calculating it again in this function?
> >
> >
> > Adding an extra member in GroupPathExtraData just for FDW does not look
> good
> > to me.
> > But yes, if we do that, then we can save this calculation.
> > Let me know if its OK to have an extra member for just FDW use, will
> prepare
> > a separate patch for that.
>
> I think that should be fine. A separate patch would be good, so that a
> committer can decide whether or not to include it.
>

Attached patch 0003 for this.


>
>
> >>
> >> +Node   *havingQual;
> >> I am wondering whether we could use remote_conds member for storing
> this.
> >
> >
> > This havingQual is later checked for shippability and classified into
> > pushable and non-pushable quals and stored in remote_conds and
> local_conds
> > respectively.
> > Storing it directly in remote_conds and then splitting it does not look
> good
> > to me.
> > Also, remote_conds is list of RestrictInfo nodes whereas havingQual is
> not.
> > So using that for storing havingQual does not make sense. So better to
> have
> > a separate member in PgFdwRelationInfo.
>
> Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> are basically the conditions on the relation being pushed down.
> havingQuals are conditions on a grouped relation so treating them like
> baserestrictinfo or join conditions looks more straight forward,
> rather than having a separate member in PgFdwRelationInfo. So, remote
> havingQuals go into remote_conds and local havingQuals go to
> local_conds.
>

OK. Agree.
In this version, I have not added anything in PgFdwRelationInfo.
Having qual is needed at two places; (1) in foreign_grouping_ok() to check
shippability, so passed this translated HAVING qual as a parameter to it,
and (2) estimating aggregates costs in estimate_path_cost_size(); there we
can use havingQual from root itself as costs won't change for parent and
child.
Thus no need of storing a havingQual in PgFdwRelationInfo.

Thanks


--
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 761d46b1b255a7521529bbe7d1b128c9d79d6b4e Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Tue, 27 Mar 2018 14:23:00 +0530
Subject: [PATCH 1/3] Remove target from GroupPathExtraData, instead fetch it
 from grouped_rel.

---
 src/backend/optimizer/plan/planner.c | 4 +---
 src/include/nodes/relation.h | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a19f5d0..d4acde6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3734,7 +3734,6 @@ create_grouping_paths(PlannerInfo *root,
 			flags |= GROUPING_CAN_PARTIAL_AGG;
 
 		extra.flags = flags;
-		extra.target = target;
 		extra.target_parallel_safe = target_parallel_safe;
 		extra.havingQual = parse->havingQual;
 		extra.targetList = parse->targetList;
@@ -6909,7 +6908,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
-	PathTarget *target = extra->target;
+	PathTarget *target = grouped_rel->reltarget;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
@@ -6943,7 +6942,6 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 			adjust_appendrel_attrs(root,
    (Node *) target->exprs,
    nappinfos, appinfos);
-		child_extra.tar

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-29 Thread Jeevan Chalke
On Thu, Mar 29, 2018 at 4:13 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat
>  wrote:
> >
> > Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> > are basically the conditions on the relation being pushed down.
> > havingQuals are conditions on a grouped relation so treating them like
> > baserestrictinfo or join conditions looks more straight forward,
> > rather than having a separate member in PgFdwRelationInfo. So, remote
> > havingQuals go into remote_conds and local havingQuals go to
> > local_conds.
>
> Looks like we already do that. Then we have remote_conds, local_conds
> which together should be equivalent to havingQual. Storing all those
> three doesn't make sense. In future someone may use havingQual instead
> of remote_conds/local_conds just because its available and then there
> is risk of these three lists going out of sync.
>

Yep, I see the risk.


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] [PATCH] Incremental sort

2018-03-29 Thread Alexander Korotkov
On Wed, Mar 28, 2018 at 7:17 PM, Andres Freund  wrote:

> On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:
> > > BTW, patch had conflicts with master.  Please, find rebased version
> attached.
> >
> > Despite by patch conflist patch looks commitable, has anybody objections
> to
> > commit it?
>
> > Patch recieved several rounds of review during 2 years, and seems to me,
> > keeping it out from sources may cause a lost it. Although it suggests
> > performance improvement in rather wide usecases.
>
> My impression it has *NOT* received enough review to be RFC. Not saying
> it's impossible to get there this release, but that just committing it
> doesn't seem wise.
>

I would say that executor part of this patch already received plenty of
review.
For sure, there still might be issues.  I just mean that amount of review of
executor part of this patch is not less than in average patch we commit.
But optimizer and costing part of this patch still need somebody to take
a look at it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Pluggable storage

2018-03-29 Thread David Steele
On 3/28/18 8:39 PM, Michael Paquier wrote:
> On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
>> I think this entry should be moved the the next CF.  I'll do that
>> tomorrow unless there are objections.
> 
> Instead of moving things to the next CF by default, perhaps it would
> make more sense to mark things as reviewed with feedback as this is the
> last CF?  There is a 5-month gap between this commit fest and the next
> one, I am getting afraid of flooding the beginning of v12 development
> cycle with entries which keep rotting over time.  If the author(s) claim
> that they will be able to work on it, then that's of course fine.

I agree and I do my best to return patches that have stalled, but I
don't think this patch is in that category.  It has gotten review and
has been kept up to date.  I don't think it's a good fit for v11 but I'd
like to see it in the first CF for v12.

> Sorry for the digression, patches ignored across CFs contribute to the
> bloat we see, and those eat the time of the CFM.

There's no question that bloat has become a problem.  I don't have all
the answers, but vigilance by the CFMs is certainly a good start.

Regards,
-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: Parallel Aggregates for string_agg and array_agg

2018-03-29 Thread David Rowley
On 30 March 2018 at 02:00, Tomas Vondra  wrote:
> On 03/29/2018 05:49 AM, David Rowley wrote:
>> Attached is v9, which is based on Tom's v8 but includes the new tests
>> and I think the required fix to disable use of the serial/deserial
>> function for array_agg().
>>
>
> I have only looked at the diff, but it seems fine to me (in the sense
> that it's doing the right checks to disable parallelism only when really
> needed).
>
> FWIW I wonder if we might simply fallback to input/output functions when
> the send/receive functions are not available. Not sure it's worth it.

I think it's a corner case to have a type without send/receive, but I
might just be lacking imagination.

I meant to mention earlier that I coded
agg_args_have_sendreceive_funcs() to only check for send/receive
functions. Really we could allow a byval types without send/receive
functions, since the serial/deserial just send the raw datums in that
case, but then the function becomes
agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
so I didn't do that.  Maybe I should?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-29 Thread Alexander Korotkov
On Thu, Mar 29, 2018 at 1:38 PM, Teodor Sigaev  wrote:

> The next question what I see: why do not we lock entry leaf pages in some
>> cases?
>>
>
> I've modified patch to predicate lock each leaf (entry or posting) page.
> Now patch reaches commitable state from my point view.


I made some enhancements in comments and readme.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Predicate-Locking-in-gin-index_v11.patch
Description: Binary data


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-29 Thread Craig Ringer
On 29 March 2018 at 20:07, Thomas Munro 
wrote:

> On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer 
> wrote:
> > On 28 March 2018 at 11:53, Tom Lane  wrote:
> >>
> >> Craig Ringer  writes:
> >> > TL;DR: Pg should PANIC on fsync() EIO return.
> >>
> >> Surely you jest.
> >
> > No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC
> as
> > well to avoid similar lost-page-write issues.
>
> I found your discussion with kernel hacker Jeff Layton at
> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
> writeup seems to want a scheme where pages stay dirty after a
> writeback failure so that we can try to fsync them again. Note that
> that has never been the case in Linux after hard writeback failures,
> AFAIK, so programs should definitely not assume that behavior."
>
> The article above that says the same thing a couple of different ways,
> ie that writeback failure leaves you with pages that are neither
> written to disk successfully nor marked dirty.
>
> If I'm reading various articles correctly, the situation was even
> worse before his errseq_t stuff landed.  That fixed cases of
> completely unreported writeback failures due to sharing of PG_error
> for both writeback and read errors with certain filesystems, but it
> doesn't address the clean pages problem.
>
> Yeah, I see why you want to PANIC.
>

In more ways than one ;)

> I'm not seeking to defend what the kernel seems to be doing. Rather,
> saying
> > that we might see similar behaviour on other platforms, crazy or not. I
> > haven't looked past linux yet, though.
>
> I see no reason to think that any other operating system would behave
> that way without strong evidence...  This is openly acknowledged to be
> "a mess" and "a surprise" in the Filesystem Summit article.  I am not
> really qualified to comment, but from a cursory glance at FreeBSD's
> vfs_bio.c I think it's doing what you'd hope for... see the code near
> the comment "Failed write, redirty."


Ok, that's reassuring, but doesn't help us on the platform the great
majority of users deploy on :(

"If on Linux, PANIC"

Hrm.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel safety of binary_upgrade_create_empty_extension

2018-03-29 Thread Tom Lane
Thomas Munro  writes:
> Presumably also cursor_to_xmlschema.  I also found some more
> suspicious brin and gin modifying functions.  So I think the list of
> functions that needs to be marked 'u' so far is:

> * binary_upgrade_create_empty_extension
> * pg_import_system_collations
> * brin_summarize_range
> * brin_summarize_new_values
> * brin_desummarize_range
> * gin_clean_pending_list
> * cursor_to_xml
> * cursor_to_xmlschema

> Has anyone got anything else?

There are some tsquery functions that execute user-supplied queries,
and so need to be 'u' on the same grounds as the cursor_to_xml cases.
I haven't checked to be sure if they are already so, but was planning
to do so before considering this matter closed.

regards, tom lane



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread John Naylor
> I agree.  In some email threads Andres has been using "JIT" as a verb,
> too, such as "JITing expressions" and such; that's a bit shocking, in a
> way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
> understand what it means, but in user-facing docs and things we should
> use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
> and so on.

Earlier today, I did some web searches to determine how people spell
"JITed" (Andres' spelling), and also found JITted, JIT-ed, JIT'd, and
jitted. No one agrees on that, but it seems very common to use "JIT"
as a verb. See the LLVM docs:

https://llvm.org/docs/DebuggingJITedCode.html

-John Naylor



Re: Cast jsonb to numeric, int, float, bool

2018-03-29 Thread Teodor Sigaev

Thanks for everyone, pushed with some editorization

Teodor Sigaev wrote:
I think, it should support from/to numeric/bool/text only. If we want to have 
casts to from numeric to other numeric types then it should be full set (int2, 
int4, int8, float4, float8).


I was too optimistic about casting to/from text. It already exists and it works 
by differ way from suggested  casts:


1) select '"foo"'::jsonb::text;  -> "foo"  // with ""
2) select '123'::jsonb::text;    -> 123
3) select '123'::jsonb::int4;    -> 123

Seems, sometime it would be desirable result in 1) as just
'foo' without "" decoration, but we cannot have 2 different explicit casts for 
the same types. So, I withdraw objections about text casting, but I still 
believe that we need one of two variants:

1) numeric/bool
2) bool, numeric/all variants of numeric types




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-29 Thread Tomas Vondra
On 03/29/2018 02:27 AM, Dean Rasheed wrote:
> On 28 March 2018 at 15:50, Tomas Vondra  wrote:
>> After thinking about this a bit more, I'm not sure if updating the info
>> based on recursive calls makes sense. The fullmatch flag was supposed to
>> answer a simple question - can there be just a single matching item?
>>
>> If there are equality conditions on all columns, there can be just a
>> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
>> then we don't need to inspect the non-MCV part.
>>
>> But handling this in recursive manner breaks this entirely, because with
>> something like
>>
>>(a=1) AND (b=1 OR b=2)
>>
>> you suddenly can have multiple matching items. Which makes the fullmatch
>> flag somewhat useless.
>>
>> So I think we should be looking at top-level equality clauses only, just
>> like number_of_groups() does.
>>
> 
> I'm not quite sure what you mean by that, but it sounds a bit limiting
> in terms of the kinds of user queries that would be supported.
> 

Let me explain. The question is "Can there be just a single combination
of values matching the conditions?" which (if true) allows us to produce
better estimates. If we found a match in the MCV, we don't need to look
at the non-MCV part. If not found in the MCV, we can compute an average
selectivity as 1/ndistinct (possibly using the ndistinct coefficients).

If we can't deduce the existence of a single possible match, we have to
compute an estimate in a more generic way.

With (a=1 AND b=1) and stats on (a,b) there's just a single possible
match (1,1), so that's fine. But it does not work once we start looking
for equalities nested deeper - for example (a=1 AND (b=1 OR b=2)) can be
translated as ((a=1 AND b=1) OR (a=1 AND b=2)) so technically there's an
equality on each column, but there are two possible matches (1,1) and
(1,2). So the optimization does not work.

Does that clarify what I meant?

Although, perhaps we could improve this by deducing number of possible
matches and then track matching items in the MCV list. But that seems
quite a bit harder.

(Of course, we need to consider the non-equality clauses in both cases,
the WIP patch does not do that yet.)

> 
>> I think we can remove the fullmatch flag from mcv_update_bitmap
>> entirely. All we need to know is the presence of equality clauses and
>> whether there was a match in MCV (which we know from s1 > 0.0).
>>
> 
> I agree with removing the fullmatch flag, but I don't think we
> actually need to know about the presence of equality clauses:
> 
> The way that mcv_update_bitmap() recursively computes the set of
> matching MCVs seems to be correct. That gives us a value (call it
> mcv_matchsel) for the proportion of the table's rows that are in the
> MCV list and satisfy the clauses in stat_clauses.
> 

Sure, but the extra bit of information allows us to (a) ignore the
non-MCV part and (b) apply the 1/ndistinct estimate.

> We can also estimate that there are (1-mcv_totalsel)*N rows that are
> not in the MCV list, for which the MCV stats therefore tell us
> nothing. The best way to estimate those rows would seem to be to use
> the logic from the guts of clauselist_selectivity(), without
> consulting any extended MCV stats (but still consulting other extended
> stats, I think). Doing that would return a selectivity value (call it
> nonmcv_sel) for those remaining rows. Then a reasonable estimate for
> the overall selectivity would seem to be
> 
>   mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel
> 
> and there would be no need for mcv_update_bitmap() to track eqmatches
> or return fullmatch, and it wouldn't actually matter whether or not we
> had equality clauses or if all the MCV columns were used.
> 

Right, although I'm not sure about fallback to clauselist_selectivity()
which kinda throws away the statistical dependency.

That's why I think we should use 1/ndistinct for equality clauses, and
then perhaps leverage the MCV for non-equality clauses somehow.

It just occurred we can apply the 1/ndistinct estimate for equalities
even when we it's not a 'fullmatch'.

So what I propose is roughly this

1) compute selectivity "mcv_sel" using MCV

2) see if there can be just a single match, and (mcv_sel > 0) - if yes,
we're done and we don't need to look at non-MCV part

3) split the clauses into top-level equality clauses and the rest

4) estimate "equal_sel" for equality clauses using 1/ndistinct

5) estimate the "inequal_sel" for remaining clauses using MCV (assumes
the selectivity will be the same on non-MCV part)

6) total selectivity is

mcv_sel + (1 - mcv_totalsel) * equal_sel * inequal_sel


We may need to fall back to clauselist_selectivity() in some cases, of
course, but I think we should leverage the MCV as much as possible.

Another thing is that some of this will change once the histograms are
considered, which helps with estimating the non-MCV part.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x

Re: Function to track shmem reinit time

2018-03-29 Thread Tomas Vondra


On 03/28/2018 08:55 PM, David Steele wrote:
> On 3/4/18 11:17 AM, Tomas Vondra wrote:
>>
>> Furthermore, the patch is yet another victim of fd1a421fe - fixing the
>> pg_proc entries is trivial, but a new version is needed.
>>
>> I'd also like to see an example/explanation how this improves this
>> situation compared to only having pg_postmaster_start_time.
>>
>> So I'm setting this as waiting on author for now.
> 
> I'm not sure why this was set back to Needs Review since it neither
> applies cleanly nor builds.
> 

It likely applied/built fine when it was set to NR, but got broken by
some recent commit.

> I'm setting this entry to Waiting on Author, but based on the discussion
> I think it should be Returned with Feedback.
> 

Fine with me.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Alvaro Herrera
John Naylor wrote:
> > I agree.  In some email threads Andres has been using "JIT" as a verb,
> > too, such as "JITing expressions" and such; that's a bit shocking, in a
> > way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
> > understand what it means, but in user-facing docs and things we should
> > use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
> > and so on.
> 
> Earlier today, I did some web searches to determine how people spell
> "JITed" (Andres' spelling), and also found JITted, JIT-ed, JIT'd, and
> jitted. No one agrees on that, but it seems very common to use "JIT"
> as a verb.

Yes: among compiler writers, people who are swimming in jitted bytes all
day long -- a tough bunch if I've seen any.  (Not that us here are
sparrows, mind.)

I meant that our docs are for normal people, not *them*.

(I too was thinking about the double 't' there while drafting the above
but decided to leave that concern out.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Typo in pg_backup_custom.c

2018-03-29 Thread Daniel Gustafsson
Attached fixes a typo, which according to Git is close to being allowed to
drive.

cheers ./daniel



typo_pgbackupcustom.patch
Description: Binary data


Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Tom Lane
Alvaro Herrera  writes:
> Simon Riggs wrote:
>> JIT means Just In Time, which could be applied to many concepts and
>> has been in use for many years in a range of concepts. particularly in
>> manufacturing/logistics and project management.

> I agree.  In some email threads Andres has been using "JIT" as a verb,
> too, such as "JITing expressions" and such; that's a bit shocking, in a
> way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
> understand what it means, but in user-facing docs and things we should
> use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
> and so on.

I'd go a little further and drop "JIT" from user-facing documentation
altogether.  Instead refer to the feature as "compilation of expressions"
or some such.  JIT is just jargon.  Plus, the timing of the compilation is
actually the least important property for our purpose.

regards, tom lane



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Stephen Frost
Greetings Tom, all,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > Simon Riggs wrote:
> >> JIT means Just In Time, which could be applied to many concepts and
> >> has been in use for many years in a range of concepts. particularly in
> >> manufacturing/logistics and project management.
> 
> > I agree.  In some email threads Andres has been using "JIT" as a verb,
> > too, such as "JITing expressions" and such; that's a bit shocking, in a
> > way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
> > understand what it means, but in user-facing docs and things we should
> > use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
> > and so on.
> 
> I'd go a little further and drop "JIT" from user-facing documentation
> altogether.  Instead refer to the feature as "compilation of expressions"
> or some such.  JIT is just jargon.  Plus, the timing of the compilation is
> actually the least important property for our purpose.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: new buildfarm with gcc & clang "trunk" -Werror?

2018-03-29 Thread Fabien COELHO


Hello Peter,


  - fix these warnings (other than putting -Wno-format-truncation or
similar workarounds...).


I've been tracking gcc-8, and I have a patch ready, but these things
change a bit with every compiler snapshot, so I'm waiting until things
settle down.


Ok, so I will not submit anything for gcc. I'll look at clang, though.


  - add permanent gcc/clang trunk beasts with -Werror
(if so, I'd suggest cannonball & seanettle for the names).


I would not like that.  The build farm success should ideally be a
function of correct PostgreSQL code, not external factors.  If an
in-progress compiler does funny things, what are we supposed to do about
that?


That happens infrequently, commits in gcc & clang are supposed not to 
break the compiler, just like with postgres:-) When it happens (3 times in 
6 months), I just report a bug on the compiler side, which is usually 
fixed in under one week.



Generally, fixing up PostgreSQL for a new compiler release isn't a major
effort and can be done briefly before the release of the compiler.


Sure.

So I understand that a -Werror beast is somehow not desirable.

--
Fabien.



Re: Parallel Aggregates for string_agg and array_agg

2018-03-29 Thread Tomas Vondra


On 03/29/2018 03:09 PM, David Rowley wrote:
> On 30 March 2018 at 02:00, Tomas Vondra  wrote:
>> On 03/29/2018 05:49 AM, David Rowley wrote:
>>> Attached is v9, which is based on Tom's v8 but includes the new tests
>>> and I think the required fix to disable use of the serial/deserial
>>> function for array_agg().
>>>
>>
>> I have only looked at the diff, but it seems fine to me (in the sense
>> that it's doing the right checks to disable parallelism only when really
>> needed).
>>
>> FWIW I wonder if we might simply fallback to input/output functions when
>> the send/receive functions are not available. Not sure it's worth it.
> 
> I think it's a corner case to have a type without send/receive, but I
> might just be lacking imagination.
> 

Not sure, but it's a valid case. OTOH we probably don't need to obsess
about handling it perfectly.

> I meant to mention earlier that I coded
> agg_args_have_sendreceive_funcs() to only check for send/receive
> functions. Really we could allow a byval types without send/receive
> functions, since the serial/deserial just send the raw datums in that
> case, but then the function becomes
> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
> so I didn't do that.  Maybe I should?
> 

I'd do that. Not sure the function name needs to change, but perhaps
agg_args_support_sendreceive() would be better - it covers both byref
types (which require send/receive functions) and byval (which don't).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: committing inside cursor loop

2018-03-29 Thread Peter Eisentraut
On 3/29/18 07:37, Andrew Gierth wrote:
>> "Peter" == Peter Eisentraut  writes:
> 
>  Peter> Committed, thanks!
> 
> So... what is a pl/* that does _not_ use pinned cursors for cursor loops
> supposed to do in this case?

Other than maybe using pinned cursors, one would have to look at the
whole picture and see what makes sense.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new buildfarm with gcc & clang "trunk" -Werror?

2018-03-29 Thread Tom Lane
Fabien COELHO  writes:
> So I understand that a -Werror beast is somehow not desirable.

Not at this point; I for one would simply ignore it.  *After* we
have decided what project policy for this class of warnings is
going to be, it might be valuable to have such a critter.  But
that's not decided yet, and I'm not at all sure that "fix every
such warning" is going to be the decision.  We might well decide
these are mostly nannyism and add a -Wno-xxx flag instead.

In the meantime, I think we already have an animal running git trunk
without -Werror; if not, feel free to create one.

regards, tom lane



Re: 2018-03 Commitfest Summary (Andres #1)

2018-03-29 Thread Bruce Momjian
On Fri, Mar  2, 2018 at 12:13:26PM -0800, Peter Geoghegan wrote:
> FWIW, I think that pgbench would become a lot more usable if someone
> maintained a toolset for managing pgbench. Something similar to Greg
> Smith's pgbench-tools project, but with additional features for
> instrumenting the server. There would be a lot of value in integrating
> it with third party tooling, such as perf and BCC, and in making it
> easy for non-experts to run relevant, representative tests.
> 
> Things like the rate limiting and alternative distributions were
> sorely needed, but there are diminishing returns. It's pretty clear to
> me that much of the remaining low hanging fruit is outside of pgbench
> itself. None of the more recent pgbench enhancements seem to make it
> easier to use.

Has anyone considered moving pgbench out of our git tree and into a
separate project where a separate team could maintain and improve it?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: 2018-03 Commitfest Summary (Andres #1)

2018-03-29 Thread Pavel Stehule
2018-03-29 16:07 GMT+02:00 Bruce Momjian :

> On Fri, Mar  2, 2018 at 12:13:26PM -0800, Peter Geoghegan wrote:
> > FWIW, I think that pgbench would become a lot more usable if someone
> > maintained a toolset for managing pgbench. Something similar to Greg
> > Smith's pgbench-tools project, but with additional features for
> > instrumenting the server. There would be a lot of value in integrating
> > it with third party tooling, such as perf and BCC, and in making it
> > easy for non-experts to run relevant, representative tests.
> >
> > Things like the rate limiting and alternative distributions were
> > sorely needed, but there are diminishing returns. It's pretty clear to
> > me that much of the remaining low hanging fruit is outside of pgbench
> > itself. None of the more recent pgbench enhancements seem to make it
> > easier to use.
>
> Has anyone considered moving pgbench out of our git tree and into a
> separate project where a separate team could maintain and improve it?
>

It shares code with psql.

Regards

Pavel


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>
>


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-03-29 Thread Alvaro Herrera
Rushabh Lathia wrote:

> CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
> CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
> INSERT INTO foo select i,i,i from generate_series(1,4)i;
> 
> CREATE TABLE foo_d (like foo);
> INSERT INTO foo_d select i,i,i from generate_series(1,9)i;
> 
> ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;
> 
> Above ATTACH PARTITION should fail with "partition constraint is violated"
> error, but instead it's working on a master branch.

Hmm, offhand I don't quite see why this error fails to be thrown.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: csv format for psql

2018-03-29 Thread Daniel Verite
David G. Johnston wrote:

> Unaligned format could grow its own fieldsep if it wanted to but it can
> also just use the default provided fieldsep var whose default value is
> pipe.  If it did grow one it would need to understand "not set" in order to
> preserve existing behavior.  We don't have that problem with csv.

fielsep is already owned by the unaligned format. No other format
uses fieldsep currently. Why would it needs to grow its own?
I don't quite see what you mean here.

For csv, Fabien and Peter expressed the opinion that we shouldn't
create another fieldsep-like variable specifically for it, but instead
reuse fieldsep. That's what my latest patch does.

Now it turns out that sharing fieldsep comes with some problems.

1. since a single variable can't have two different default values,
we have either to accept '|' as a default csv separator, or introduce
some tricks to automagically commute the value.

2. when a user does \pset fieldsep ';' there are doubts
on whether this should affect the current mode only, or both,
or even the other one when the user goes back and forth between
formats. Any proposal where a \pset var value affects by
side-effect another pset variable is inconsistent with what has been
done until now with these variables, and I expect that
side-effects are not exactly sought after in general.

3. the command-line cannot express: use character 'X' for unaligned
and 'Y' for csv for the rest of the session. The current patch provides
'--csv' to select both the csv format and a comma separator, but
when combined with -F, the end result depend on the relative position
of the options, which may feel unintuitive or buggy.
Again we could tweak that, but only by being inconsistent
with how we process other options.

Personally I think the benefit of sharing fieldsep is not worth these
problems, but I'm waiting for the discussion to continue with
more opinions.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: committing inside cursor loop

2018-03-29 Thread Andrew Gierth
> "Peter" == Peter Eisentraut  writes:

 >> So... what is a pl/* that does _not_ use pinned cursors for cursor
 >> loops supposed to do in this case?

 Peter> Other than maybe using pinned cursors, one would have to look at
 Peter> the whole picture and see what makes sense.

Never mind, I was overlooking the obvious: explicitly specifying
CURSOR_OPT_HOLD is sufficient for me. (Can't really use pinned cursors
because they might not be unpinned fast enough due to timing of garbage
collection, which would lead to spurious errors.)

-- 
Andrew (irc:RhodiumToad)



Re: Allow workers to override datallowconn

2018-03-29 Thread Tomas Vondra


On 03/02/2018 12:16 PM, Magnus Hagander wrote:
> On Fri, Feb 23, 2018 at 7:55 PM, Magnus Hagander  > wrote:
> 
> On Fri, Feb 23, 2018 at 7:52 PM, Tom Lane  > wrote:
> 
> Magnus Hagander  > writes:
> > Here's another attempt at moving this one forward. Basically this 
> adds a
> > new GucSource being GUC_S_CLIENT_EARLY. It now runs through the 
> parameters
> > once before CheckMyDatabase, with source set to GUC_S_CLIENT_EARLY. 
> In this
> > source, *only* parameters that are flagged as GUC_ALLOW_EARLY will 
> be set,
> > any other parameters are ignored (without error). For now, only the
> > ignore_connection_restriction is allowed at this stage. Then it runs
> > CheckMyDatabase(), and after that it runs through all the 
> parameters again,
> > now with the GUC_S_CLIENT source as usual, which will now process 
> all
> > other  variables.
> 
> Ick.  This is an improvement over the other way of attacking the
> problem?
> I do not think so.
> 
> 
> Nope, I'm far from sure that it is. I just wanted to show what it'd
> look like.
> 
> I personally think the second patch (the one adding a parameter to
> BackendWorkerInitializeConnection) is the cleanest one. It doesn't
> solve Andres' problem, but perhaps that should be the job of a
> different patch.
> 
> 
> 
> FWIW, I just realized that thue poc patch that adds the GUC also breaks
> a large part of the regression tests. As a side-effect of it breaking
> how DateStyle works. That's obviously a fixable problem, but it seems
> not worth spending time on if that's not the way forward anyway.
> 
> Andres, do you have any other ideas of directions to look that would
> make you withdraw your objection? I'm happy to try to write up a patch
> that solves it in a way that everybody can agree with. But right now it
> seems to be stuck between one way that's strongly objected to by you and
> one way that's strongly objected to by Tom. And I'd rather not have that
> end up with not getting the problem solved at all for *any* of the
> usecases...
> 

My 2c: I think we should just go with the simplest solution, that is the
patch sent on 22/2 19:54 (i.e. BackgroundWorkerInitializeConnectionByOid
with an extra parameter).

It would be nice to have something more generic that also works for the
Andres' use case, but the patches submitted to this thread are not
particularly pretty. Also, Tom suggested there may be safety issues when
the GUCs are processed earlier - I agree Andres is right the GUCs are
effectively ASCII-only so the encoding is not an issue, but perhaps
there are other issues (Tom suggested this merely as an example).

So I agree with Magnus, the extra flag seems to be perfectly fine for
bgworkers, and I'd leave the more generic solution for a future patch if
anyone wants to hack on it.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-03-29 Thread Daniel Gustafsson
> On 27 Mar 2018, at 13:50, Eren Başak  wrote:

> > pg_cancel_backend() is defined proisstrict, while pg_cancel_backend_msg() is
> > not.  I think we must be able to perform the cancellation if the message is
> > NULL, so made it two functions.
> 
> I agree that we should preserve the old usage as well. What I don't 
> understand is that, can't we remove proisstrict from pg_cancel_backend and 
> copy the body of pg_cancel_backend_msg into pg_cancel_backend. This way, I 
> think we can accomplish the same thing without introducing two new functions.

If we do that, wouldn’t SELECT pg_cancel_backend(NULL) fail unless NULL is
explicitly casted to integer?  Also, I think that would cause make check to
fail the opr_sanity suite on the “multiple uses of the same internal function”
test.  Maybe I’m misunderstanding your point here?

> pg_terminate_backend_msg errors out if the pid is null but 
> pg_cancel_backend_msg returns null and I think we should have consistency 
> between these two in this regard.

Absolutely, that was an oversight in the v8 patch, they should both handle NULL
in the same way.

Whitespace and null handling fixed, as well as rebased over HEAD.

cheers ./daniel



0001-Refactor-backend-signalling-code-v9.patch
Description: Binary data


0002-Support-optional-message-in-backend-cancel-terminate-v9.patch
Description: Binary data


Re: PATCH: Configurable file mode mask

2018-03-29 Thread David Steele
On 3/28/18 1:59 AM, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 04:21:09PM -0400, David Steele wrote:
>> These updates address Michael's latest review and implement group access
>> for pg_basebackup, pg_receivewal, and pg_recvlogical.  A new internal
>> GUC, data_directory_group_access, allows remote processes to determine
>> the correct mode using the existing SHOW protocol command.
> 
> Some nits from patch 1...
> 
> +   ok (check_mode_recursive($node_master->data_dir(), 0700, 0600));
> [...]
> +   ok (check_mode_recursive($node_master->data_dir(), 0700, 0600));
> Incorrect indentations (space after "ok", yes that's a nit...).

Fixed the space, not sure about the indentations?

> And more things about patch 1 which are not nits.
> 
> In pg_backup_tar.c, we still have a 0600 hardcoded.  You should use
> PG_FILE_MODE_DEFAULT there as well.

I'm starting to wonder if these changes in pg_dump make sense.  The
file/tar permissions here do not map directly to anything in the PGDATA
directory (since the dump and restore are logical).  Perhaps we should
be adding a -g option for pg_dump (in a separate patch) if we want this
functionality?

> dsm_impl_posix() can create a new segment with shm_open using as mode
> 0600.  This is present in pg_dynshmem, which would be included in
> backups.  First, it seems to me that this should use
> PG_FILE_MODE_DEFAULT.  Then, with patch 2 it seems to me that if an
> instance is using DSM then there is a risk of breaking a simple backup
> which uses for example "tar" without --exclude filters with group access
> (sometimes scripts are not smart enough to skip the same contents as
> base backups).  So it seems to me that DSM should be also made more
> aware of group access to ease the life of users.

Done in patch 1.  For patch 2, do you see any issues with the shared
memory being group readable from a security perspective?  The user can
read everything on disk so it's hard to see how it's a problem.  Also,
if these files are ending up in people's backups...

> And then for patch 2, a couple of issues spotted..
> 
> +   /* for previous versions set the default group access */
> +   if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_GROUP_ACCESS)
> +   {
> +   DataDirGroupAccess = false;
> +   return false;
> +   }
> Enforcing DataDirGroupAccess to false for servers older than v11 is
> fine, but RetrieveDataDirGroupAccess() should return true.  If I read
> your patch correctly then support for pg_basebackup on older server
> would just fail.

You are correct, fixed.

> +#if !defined(WIN32) && !defined(__CYGWIN__)
> +   if ((stat_buf.st_mode & PG_DIR_MODE_DEFAULT) == PG_DIR_MODE_DEFAULT)
> +   {
> +   umask(PG_MODE_MASK_ALLOW_GROUP);
> +   DataDirGroupAccess = true;
> +   }
> This should use SetConfigOption() or I am missing something?

Done.

> +/*
> + * Does the data directory allow group read access?  The default is false, 
> i.e.
> + * mode 0700, but may be changed in checkDataDir() to true, i.e. mode 0750.
> + */
> +bool DataDirGroupAccess = false;
> 
> Instead of a boolean, I would suggest to use an integer, this is more
> consistent with log_file_mode.  

Well, the goal was to make this implementation independent, but I'm not
against the idea.  Anybody have a preference?

> Actually, about that, should we complain
> if log_file_mode is set to a value incompatible?

I think control of the log file mode should be independent.  I usually
don't store log files in PGDATA at all.  What if we set log_file_mode
based on the -g option passed to initdb?  That will work well for
default installations while providing flexibility to others.

Thanks,
-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: JIT compiling with LLVM v12

2018-03-29 Thread Pierre Ducroquet
On Thursday, March 29, 2018 2:39:17 PM CEST Jesper Pedersen wrote:
> Hi Andres,
> 
> On 03/28/2018 05:27 PM, Andres Freund wrote:
> > On 2018-03-27 10:34:26 -0700, Andres Freund wrote:
> >> On 2018-03-27 10:05:47 -0400, Peter Eisentraut wrote:
> >>> On 3/13/18 19:40, Andres Freund wrote:
>  I've pushed a revised and rebased version of my JIT patchset.
> >>> 
> >>> What is the status of this item as far as the commitfest is concerned?
> >> 
> >> 7/10 committed. Inlining, Explain, Docs remain.
> > 
> > I've pushed these three.
> 
> It seems that clang is being picked up as the main compiler in certain
> situations, ala
> 
> ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -g -O0 -fno-omit-frame-pointer
> -I../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o
> auth-scram.o auth-scram.c -MMD -MP -MF .deps/auth-scram.Po
> ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -g -O0 -fno-omit-frame-pointer
> -I../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o
> be-secure-openssl.o be-secure-openssl.c -MMD -MP -MF
> .deps/be-secure-openssl.Po
> /usr/lib64/ccache/clang -Wno-ignored-attributes -fno-strict-aliasing
> -fwrapv -O2  -I../../../src/include  -D_GNU_SOURCE
> -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o be-fsstubs.bc
> be-fsstubs.c
> /usr/lib64/ccache/clang -Wno-ignored-attributes -fno-strict-aliasing
> -fwrapv -O2  -I../../../src/include  -D_GNU_SOURCE
> -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o namespace.bc namespace.c
> 
> I would expect LLVM to be isolated to the jit/ hierarchy.

Clang is needed to emit the LLVM bitcode required for inlining. The "-emit-
llvm" flag is used for that. A dual compilation is required for inlining to 
work, one compilation with gcc/clang/msvc/… to build the postgresql binary, 
one with clang to generate the .bc files for inlining.
It can be surprising, but there is little way around that (or we accept only 
clang to build postgresql, but there would be a riot).




Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread David Steele
On 3/29/18 9:51 AM, Stephen Frost wrote:
> Greetings Tom, all,
> 
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Alvaro Herrera  writes:
>>> Simon Riggs wrote:
 JIT means Just In Time, which could be applied to many concepts and
 has been in use for many years in a range of concepts. particularly in
 manufacturing/logistics and project management.
>>
>>> I agree.  In some email threads Andres has been using "JIT" as a verb,
>>> too, such as "JITing expressions" and such; that's a bit shocking, in a
>>> way.  Honestly I don't care in a pgsql-hackers thread, I mean we all
>>> understand what it means, but in user-facing docs and things we should
>>> use complete words, "JIT-compile", "JIT-compilation", "JIT-compiling"
>>> and so on.
>>
>> I'd go a little further and drop "JIT" from user-facing documentation
>> altogether.  Instead refer to the feature as "compilation of expressions"
>> or some such.  JIT is just jargon.  Plus, the timing of the compilation is
>> actually the least important property for our purpose.
> 
> Agreed.

+1. Or simply "expression compilation".

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: csv format for psql

2018-03-29 Thread David G. Johnston
On Thu, Mar 29, 2018 at 7:30 AM, Daniel Verite 
wrote:

> David G. Johnston wrote:
>
> > Unaligned format could grow its own fieldsep if it wanted to but it can
> > also just use the default provided fieldsep var whose default value is
> > pipe.  If it did grow one it would need to understand "not set" in order
> to
> > preserve existing behavior.  We don't have that problem with csv.
>
> fielsep is already owned by the unaligned format. No other format
> uses fieldsep currently. Why would it needs to grow its own?
> I don't quite see what you mean here.
>

​ ​Its mostly a terminology concern...
  ​

>
> For csv, Fabien and Peter expressed the opinion that we shouldn't
> create another fieldsep-like variable specifically for it, but instead
> reuse fieldsep. That's what my latest patch does.
>
> Now it turns out that sharing fieldsep comes with some problems.
> ​[...]​
>
>
> Personally I think the benefit of sharing fieldsep is not worth these
> problems, but I'm waiting for the discussion to continue with
> more opinions.


Basically that because "fieldsep" lacks any kind of suffix it can be
considered "unowned" versus "fieldsep_csv" which would be "owned" by the
"csv" format.​  Given the problems I do want an "owned by csv" variable.
An owned version for unaligned would then be named "fieldsep_unaligned".
But we wouldn't want to rename fieldsep, we'd simply leave it in place as a
generic fieldsep variable that formats can choose to use if they wish -
with the limitations that come with sharing it with unaligned.  Decent odds
that no other format would want to but we cannot remove it so we might as
well better describe its place now that two formats care about field
separation.

fieldsep is "shared" even though at present no one else is sharing it.
fieldsep_csv is not shared.  fieldsep_unaligned doesn't exist because
unaligned uses the shared variable.  No one else uses fieldsep, shared or
owned.

Saying the "fieldsep" is owned by unaligned mode would be technically
correct, though there is no "mark" that makes is readily obvious.

Implementation wise "ownership" could mean that changing the delimiter can
only occur while the format type matches.  But we could not do that for
fieldsep since no such ordering currently is enforced.

David J.


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-03-29 Thread Rushabh Lathia
On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera 
wrote:

> Rushabh Lathia wrote:
>
> > CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
> > CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
> > CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
> > INSERT INTO foo select i,i,i from generate_series(1,4)i;
> >
> > CREATE TABLE foo_d (like foo);
> > INSERT INTO foo_d select i,i,i from generate_series(1,9)i;
> >
> > ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;
> >
> > Above ATTACH PARTITION should fail with "partition constraint is
> violated"
> > error, but instead it's working on a master branch.
>
> Hmm, offhand I don't quite see why this error fails to be thrown.
>
>
ATTACH PARTITION should throw an error, because partition table "foo"
already have two partition with key values (1, 2,3 4). And table "foo_d"
which we are attaching here has :

postgres@76099=#select * from foo_d;
 a | b | c
---+---+---
 1 | 1 | 1
 2 | 2 | 2
 3 | 3 | 3
 4 | 4 | 4
 5 | 5 | 5
 6 | 6 | 6
 7 | 7 | 7
 8 | 8 | 8
 9 | 9 | 9
(9 rows)

After ATTACH PARTITION, when we see the partition constraint for
the newly attached partition:

postgres@76099=#\d+ foo_d
Table "public.foo_d"
 Column |   Type| Collation | Nullable | Default | Storage  |
Stats target | Description
+---+---+--+-+--+--+-
 a  | integer   |   |  | | plain|
|
 b  | integer   |   |  | | plain|
|
 c  | character varying |   |  | | extended |
|
Partition of: foo DEFAULT
Partition constraint: (NOT ((a IS NOT NULL) AND (a = ANY (ARRAY[1, 2, 3,
4]

So, it says that this partition (table) should not include values (1,2, 3,
4). But
it sill has those values.

postgres@76099=#select tableoid::regclass, * from foo;
 tableoid | a | b | c
--+---+---+---
 foo_p1   | 1 | 1 | 1
 foo_p1   | 2 | 2 | 2
 foo_p2   | 3 | 3 | 3
 foo_p2   | 4 | 4 | 4
 foo_d| 1 | 1 | 1
 foo_d| 2 | 2 | 2
 foo_d| 3 | 3 | 3
 foo_d| 4 | 4 | 4
 foo_d| 5 | 5 | 5
 foo_d| 6 | 6 | 6
 foo_d| 7 | 7 | 7
 foo_d| 8 | 8 | 8
 foo_d| 9 | 9 | 9
(13 rows)

So basically ATTACH PARTITION should throw an error when partition
constraint is violated.


-- 
Rushabh Lathia
www.EnterpriseDB.com


Re: JIT compiling with LLVM v12

2018-03-29 Thread Jesper Pedersen

Hi,

On 03/29/2018 11:03 AM, Pierre Ducroquet wrote:

Clang is needed to emit the LLVM bitcode required for inlining. The "-emit-
llvm" flag is used for that. A dual compilation is required for inlining to
work, one compilation with gcc/clang/msvc/… to build the postgresql binary,
one with clang to generate the .bc files for inlining.
It can be surprising, but there is little way around that (or we accept only
clang to build postgresql, but there would be a riot).



Thanks Pierre.

Best regards,
 Jesper
diff --git a/src/backend/jit/README b/src/backend/jit/README
index bfed319189..dca4aa761c 100644
--- a/src/backend/jit/README
+++ b/src/backend/jit/README
@@ -55,6 +55,9 @@ unlikely to be discontinued, because it has a license compatible with
 PostgreSQL, and because its LLVM IR can be generated from C
 using the clang compiler.
 
+clang will be used to compile files where LLVM bitcode is required to
+inline functions using the -emit-llvm flag. This may cause PostgreSQL
+to be compiled with multiple compilers, such as gcc and clang.
 
 Shared Library Separation
 -


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-03-29 Thread Alvaro Herrera
Rushabh Lathia wrote:
> On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera 
> wrote:

> > Hmm, offhand I don't quite see why this error fails to be thrown.
>
> ATTACH PARTITION should throw an error, because partition table "foo"
> already have two partition with key values (1, 2,3 4). And table "foo_d"
> which we are attaching here has :

Oh, I understand how it's supposed to work.  I was just saying I don't
understand how this bug occurs.  Is it because we fail to determine the
correct partition constraint for the default partition in time for its
verification scan?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [POC] Faster processing at Gather node

2018-03-29 Thread Bruce Momjian
On Fri, Mar  2, 2018 at 05:21:28PM -0500, Tels wrote:
> Hello Robert,
> 
> On Fri, March 2, 2018 12:22 pm, Robert Haas wrote:
> > On Wed, Feb 28, 2018 at 10:06 AM, Robert Haas 
> > wrote:
> >> [ latest patches ]
> >
> > Committed.  Thanks for the review.
> 
> Cool :)
> 
> There is a typo, tho:
> 
> + /*
> +  * If the counterpary is known to have attached, we can read mq_receiver
> +  * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
> +  * more caution is needed.
> +  */
> 
> s/counterpary/counterparty/;
> 
> Sorry, only noticed while re-reading the thread.
> 
> Also, either a double space is missing, or one is too many:
> 
> + /*
> +  * Separate prior reads of mq_ring from the increment of mq_bytes_read
> +  * which follows.  Pairs with the full barrier in shm_mq_send_bytes(). 
> We
> +  * only need a read barrier here because the increment of mq_bytes_read 
> is
> +  * actually a read followed by a dependent write.
> +  */
> 
> ("  Pairs ..." vs. ". We only ...")
> 
> Best regards,

Change applied with the attached patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
new file mode 100644
index 3faace2..c80cb6e
*** a/src/backend/storage/ipc/shm_mq.c
--- b/src/backend/storage/ipc/shm_mq.c
*** shm_mq_sendv(shm_mq_handle *mqh, shm_mq_
*** 493,499 
  		return SHM_MQ_DETACHED;
  
  	/*
! 	 * If the counterpary is known to have attached, we can read mq_receiver
  	 * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
  	 * more caution is needed.
  	 */
--- 493,499 
  		return SHM_MQ_DETACHED;
  
  	/*
! 	 * If the counterparty is known to have attached, we can read mq_receiver
  	 * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
  	 * more caution is needed.
  	 */
*** shm_mq_inc_bytes_read(shm_mq *mq, Size n
*** 1203,1211 
  
  	/*
  	 * Separate prior reads of mq_ring from the increment of mq_bytes_read
! 	 * which follows.  Pairs with the full barrier in shm_mq_send_bytes(). We
! 	 * only need a read barrier here because the increment of mq_bytes_read is
! 	 * actually a read followed by a dependent write.
  	 */
  	pg_read_barrier();
  
--- 1203,1211 
  
  	/*
  	 * Separate prior reads of mq_ring from the increment of mq_bytes_read
! 	 * which follows.  This pairs with the full barrier in shm_mq_send_bytes().
! 	 * We only need a read barrier here because the increment of mq_bytes_read
! 	 * is actually a read followed by a dependent write.
  	 */
  	pg_read_barrier();
  


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-29 Thread Catalin Iacob
On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro
 wrote:
> I found your discussion with kernel hacker Jeff Layton at
> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
> writeup seems to want a scheme where pages stay dirty after a
> writeback failure so that we can try to fsync them again. Note that
> that has never been the case in Linux after hard writeback failures,
> AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So,
what are the options at this point? The assumption was that we can
repeat the fsync (which as you point out is not the case), or shut
down the database and perform recovery from WAL", the same Jeff Layton
seems to agree PANIC is the appropriate response:
"Replaying the WAL synchronously sounds like the simplest approach
when you get an error on fsync. These are uncommon occurrences for the
most part, so having to fall back to slow, synchronous error recovery
modes when this occurs is probably what you want to do.".
And right after, he confirms the errseq_t patches are about always
detecting this, not more:
"The main thing I working on is to better guarantee is that you
actually get an error when this occurs rather than silently corrupting
your data. The circumstances where that can occur require some
corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

> The article above that says the same thing a couple of different ways,
> ie that writeback failure leaves you with pages that are neither
> written to disk successfully nor marked dirty.
>
> If I'm reading various articles correctly, the situation was even
> worse before his errseq_t stuff landed.  That fixed cases of
> completely unreported writeback failures due to sharing of PG_error
> for both writeback and read errors with certain filesystems, but it
> doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed
independently before reading your sentence above). The errseq_t
patches landed in v4.13 by the way, so very recently.

> Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-29 Thread Simon Riggs
On 29 March 2018 at 01:50, Robert Haas  wrote:
> On Tue, Mar 27, 2018 at 11:53 PM, Pavan Deolasee
>  wrote:
>> Yeah, we should not do that. The patch surely does not intend to replay any
>> more WAL than what we do today. The idea is to just use a different
>> mechanism to find the prior checkpoint. But we should surely find the latest
>> prior checkpoint. In the rare scenario that Tom showed, we should just throw
>> an error and fix the patch if it's not doing that already.
>
> It's not clear to me that there is any reasonable fix short of giving
> up on this approach altogether.  But I might be missing something.

(Sorry to come back to this late)

Yes, a few things have been been missed in this discussion.

Michael's point is invalid. The earlier discussion said that *replay*
of WAL earlier than the last checkpoint could lead to corruption, but
this doesn't mean the records themselves are in some way damaged
structurally, only that their content can lead to inconsistency in the
database. The xl_curr proposal does not replay WAL records it only
searches thru them to locate the last checkpoint.

Tom also said "Well, the point of checkpoints is that WAL data before
the last one should no longer matter anymore, isn't it?". WAL Files
prior to the one containing the last checkpoint can be discarded, but
checkpoints themselves do nothing to invalidate the WAL records before
the checkpoint.

Tom's point that a corruption in the WAL file could prevent pg_rewind
from locating a record is valid, but misses something. If the WAL file
contained corruption AFTER the last checkpoint this would already in
the current code prevent pg_rewind from moving backwards to find the
last checkpoint. So whether you go forwards or backwards it is
possible to be blocked by corrupt WAL records.

Most checkpoints are far enough apart that there is only one
checkpoint record in any WAL file and it could be anywhere in the
file, so the expected number of records to be searched by forwards or
backwards scans is equal. And the probability that corruption occurs
before or after the record is also equal.

Given the above points, I don't see any reason to believe that the
forwards search mechanism proposed is faulty, decreases robustness or
is even slower.

Overall, the loss of links between WAL records is not a loss of
robustness. XLOG_SWITCH is a costly operation and mostly unusable
these days, so shouldn't be a reason to shy away from this patch. We
don't jump around in WAL in any other way, so sequential records are
just fine.

So IMHO the proposed approach is still very much viable.

I know the approach is new and surprising but I thought about it a lot
before proposing it and I couldn't see any holes; still can't. Please
give this some thought so we can get comfortable with this idea and
increase performance as a result. Thanks.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: SET TRANSACTION in PL/pgSQL

2018-03-29 Thread Peter Eisentraut
On 3/15/18 17:49, Alexander Korotkov wrote:
> I didn't dig deeply into this subject.  But should we rather teach SPI
> to execute
> utility statements without taking snapshot when not necessary.  That seems
> like what executor do for client provided queries.  And that seems a bit
> unlogical
> that SPI behaves differently.

Here is the same patch rewritten using SPI, using the new no_snapshots
facility recently introduced.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 02dfce5edb9dd63005e1be81c8bef3d01bff2dd6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 29 Mar 2018 12:00:51 -0400
Subject: [PATCH v2] PL/pgSQL: Add support for SET TRANSACTION

A normal SQL command run inside PL/pgSQL acquires a snapshot, but SET
TRANSACTION does not work anymore if a snapshot is set.  So we have to
handle this separately.
---
 .../plpgsql/src/expected/plpgsql_transaction.out   | 29 ++
 src/pl/plpgsql/src/pl_exec.c   | 35 ++
 src/pl/plpgsql/src/pl_funcs.c  | 23 ++
 src/pl/plpgsql/src/pl_gram.y   | 32 +++-
 src/pl/plpgsql/src/pl_scanner.c|  2 ++
 src/pl/plpgsql/src/plpgsql.h   | 13 +++-
 src/pl/plpgsql/src/sql/plpgsql_transaction.sql | 25 
 7 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/src/pl/plpgsql/src/expected/plpgsql_transaction.out 
b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
index b7f77101c3..2d0e3fa85e 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_transaction.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
@@ -389,6 +389,35 @@ SELECT * FROM test3;
  1
 (1 row)
 
+-- SET TRANSACTION
+DO LANGUAGE plpgsql $$
+BEGIN
+PERFORM 1;
+RAISE INFO '%', current_setting('transaction_isolation');
+COMMIT;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+PERFORM 1;
+RAISE INFO '%', current_setting('transaction_isolation');
+COMMIT;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+RESET TRANSACTION ISOLATION LEVEL;
+PERFORM 1;
+RAISE INFO '%', current_setting('transaction_isolation');
+COMMIT;
+END;
+$$;
+INFO:  read committed
+INFO:  repeatable read
+INFO:  read committed
+-- error case
+DO LANGUAGE plpgsql $$
+BEGIN
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+END;
+$$;
+ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
+CONTEXT:  SQL statement "SET TRANSACTION ISOLATION LEVEL REPEATABLE READ"
+PL/pgSQL function inline_code_block line 3 at SET
 DROP TABLE test1;
 DROP TABLE test2;
 DROP TABLE test3;
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index f574aa77f0..91a8cf0950 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -305,6 +305,8 @@ static int exec_stmt_commit(PLpgSQL_execstate *estate,
 PLpgSQL_stmt_commit *stmt);
 static int exec_stmt_rollback(PLpgSQL_execstate *estate,
   PLpgSQL_stmt_rollback *stmt);
+static int exec_stmt_set(PLpgSQL_execstate *estate,
+  PLpgSQL_stmt_set *stmt);
 
 static void plpgsql_estate_setup(PLpgSQL_execstate *estate,
 PLpgSQL_function *func,
@@ -2005,6 +2007,10 @@ exec_stmt(PLpgSQL_execstate *estate, PLpgSQL_stmt *stmt)
rc = exec_stmt_rollback(estate, (PLpgSQL_stmt_rollback 
*) stmt);
break;
 
+   case PLPGSQL_STMT_SET:
+   rc = exec_stmt_set(estate, (PLpgSQL_stmt_set *) stmt);
+   break;
+
default:
estate->err_stmt = save_estmt;
elog(ERROR, "unrecognized cmd_type: %d", 
stmt->cmd_type);
@@ -4707,6 +4713,35 @@ exec_stmt_rollback(PLpgSQL_execstate *estate, 
PLpgSQL_stmt_rollback *stmt)
return PLPGSQL_RC_OK;
 }
 
+/*
+ * exec_stmt_set
+ *
+ * Execute SET/RESET statement.
+ *
+ * We just parse and execute the statement normally, but we have to do it
+ * without setting a snapshot, for things like SET TRANSACTION.
+ */
+static int
+exec_stmt_set(PLpgSQL_execstate *estate, PLpgSQL_stmt_set *stmt)
+{
+   PLpgSQL_expr *expr = stmt->expr;
+   int rc;
+
+   if (expr->plan == NULL)
+   {
+   exec_prepare_plan(estate, expr, 0, true);
+   expr->plan->no_snapshots = true;
+   }
+
+   rc = SPI_execute_plan(expr->plan, NULL, NULL, estate->readonly_func, 0);
+
+   if (rc != SPI_OK_UTILITY)
+   elog(ERROR, "SPI_execute_plan failed executing query \"%s\": 
%s",
+expr->query, SPI_result_code_string(rc));
+
+   return PLPGSQL_RC_OK;
+}
+
 /* --
  * exec_assign_exprPut an expression's result into a 
variable.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-03-29 Thread Konstantin Knizhnik



On 11.01.2018 22:41, Peter Eisentraut wrote:

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here.  This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together.  One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction.  The top-level data structure for that is the infamous
ReorderBuffer.  So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange).  When the transaction commits, the
accumulated changes are passed to the output plugin and then freed.  If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction.  When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk.  The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk.  When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example.  Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example).  If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender).  And so on.  If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB.  And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time.  That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096.  The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction.  (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes.  The processing
at commit time still uses the previous hardcoded mechanism.  The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing.  There are some
proposals to change this as well, but they are more involved.  Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached.  With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit.  Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way.  The proposed patch searches
through the entire list of transactions to find the largest one.  But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-29 Thread Tom Lane
Simon Riggs  writes:
> I know the approach is new and surprising but I thought about it a lot
> before proposing it and I couldn't see any holes; still can't. Please
> give this some thought so we can get comfortable with this idea and
> increase performance as a result. Thanks.

The long and the short of it is that this is a very dangerous-looking
proposal, we are at the tail end of a development cycle, and there are
~100 other patches remaining in the commitfest that also have claims
on our attention in the short time that's left.  If you're expecting
people to spend more time thinking about this now, I feel you're being
unreasonable.

Also, I will say it once more: this change DOES decrease robustness.
It's like blockchain without the chain aspect, or git commits without
a parent pointer.  We are not only interested in whether individual
WAL records are valid, but whether they form a consistent series.
Cross-checking xl_prev provides some measure of confidence about that;
xl_curr offers none.

regards, tom lane



Re: [HACKERS] path toward faster partition pruning

2018-03-29 Thread Tomas Vondra
Hi,

I think there's a bug in generate_pruning_steps_from_opexprs, which does
this for PARTITION_STRATEGY_HASH:


for_each_cell(lc1, lc)
{
pc = lfirst(lc1);

/*
 * Note that we pass nullkeys for step_nullkeys,
 * because we need to tell hash partition bound search
 * function which of the keys are NULL.
 */
Assert(pc->op_strategy == HTEqualStrategyNumber);
pc_steps =
get_steps_using_prefix(context,
   HTEqualStrategyNumber,
   pc->expr,
   pc->cmpfn,
   pc->keyno,
   nullkeys,
   prefix);
}

opsteps = list_concat(opsteps, list_copy(pc_steps));


Notice that the list_concat() is outside the for_each_cell loop. Doesn't
that mean we fail to consider some of the clauses (all except the very
last clause) for pruning? I haven't managed to come up with an example,
but I haven't tried very hard.

FWIW I've noticed this because gcc complains that pg_steps might be used
uninitialized:

partprune.c: In function ‘generate_partition_pruning_steps_internal’:
partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
  opsteps = list_concat(opsteps, list_copy(pc_steps));
^
partprune.c:936:14: note: ‘pc_steps’ was declared here
  List   *pc_steps;
  ^~~~
All of PostgreSQL successfully made. Ready to install.


So even if it's not a bug, we probably need to fix the code somehow.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-29 Thread Teodor Sigaev



Alvaro Herrera wrote:

I don't quite understand the new call in gininsert -- I mean I see that
it wants to check for conflicts even when fastupdate is set, but why?
If fastupdate is set then we check conflict with whole index, not a particular 
pages in it. Predicate lock on penging list pages will be effectively a lock 
over index, because every scan will begin from pending list and each insert will 
insert into it. I



Maybe, just maybe, it would be better to add a new flag to the
GinCheckForSerializableConflictIn function, that's passed differently
for this one callsite, and then a comment next to it that indicates why
do we test for fastupdates in one case and not the other case.
If you don't like this idea, then I think more commentary on why
fastupdate is not considered in gininsert is warranted.

In startScanEntry, if you "goto restartScanEntry" in the fastupdate
case, are you trying to acquire the same lock again?  Maybe the lock
acquire should occur before the goto target? (If this doesn't matter for
some reason, maybe add a comment about it)
Thank you for noticing that, I've completely rework this part. Somehow I 
misreaded actual work of GinGetPendingListCleanupSize() :(.


See attached patch
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
commit 699e84fab488145d8109f6466f664024855aca38
Author: Teodor Sigaev 
Date:   Thu Mar 29 20:05:35 2018 +0300

Predicate locking in GIN index

Predicate locks are used on per page basis only if fastupdate = off, in
opposite case predicate lock on pending list will effectively lock whole index,
to reduce locking overhead, just lock a relation. Entry and posting trees are
essentially B-tree, so locks are acquired on leaf pages only.

Author: Shubham Barai with some editorization by me and Dmitry Ivanov
Review by: Alexander Korotkov, Dmitry Ivanov, Fedor Sigaev
Discussion: https://www.postgresql.org/message-id/flat/calxaept5sww+ewtakugfl5_xfcz0mugbcyj70oqbwqr42yk...@mail.gmail.com

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 37070b3b72..095b1192cb 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/xloginsert.h"
+#include "storage/predicate.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -515,6 +516,19 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			btree->fillRoot(btree, newrootpg,
 			BufferGetBlockNumber(lbuffer), newlpage,
 			BufferGetBlockNumber(rbuffer), newrpage);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(lbuffer));
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
+
 		}
 		else
 		{
@@ -524,6 +538,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
 			GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
 			GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
 		}
 
 		/*
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index f9daaba52e..642ca1a2c7 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -19,6 +19,7 @@
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/rel.h"
 
 /*
@@ -1759,7 +1760,7 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
  */
 BlockNumber
 createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
-  GinStatsData *buildStats)
+  GinStatsData *buildStats, Buffer entrybuffer)
 {
 	BlockNumber blkno;
 	Buffer		buffer;
@@ -1810,6 +1811,12 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	page = BufferGetPage(buffer);
 	blkno = BufferGetBlockNumber(buffer);
 
+	/*
+	 * Copy a predicate lock from entry tree leaf (containing posting list)
+	 * to  posting tree.
+	 */
+	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
@@ -1904,6 +1911,7 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 		btree.itemptr = insertdata.items[insertdata.curitem];
 		stack = ginFindLeafPage(&btree, false, NULL);
 
+		GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
 		ginInsertValue(&btree, stack, &insertdata, buildStats);
 	}
 }
diff --git a/sr

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-29 Thread Tom Lane
Claudio Freire  writes:
> On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane  wrote:
>> After 0001,
>> there's no reason to assume that vacuum is particularly likely to get
>> cancelled between having made cleanups and having updated the upper FSM
>> levels.  (Maybe the odds are a bit more for the no-indexes case, but
>> that doesn't seem like it's common enough to justify a special mechanism
>> either.)

> Why not?

> Any kind of DDL (even those that don't rewrite the heap) would cancel
> autovacuum.

> You might think DDL isn't common enough to worry about, but I've seen
> cases where regular reindex were required to keep index bloat in check
> (and were cron'd), and those cancel autovacuum all the time.

If you've got a situation where every vacuum gets canceled partway
through, you've got bloat problems regardless, because the heap tuples are
never getting removed in the first place; worrying about whether the FSM
is up to date is pretty pointless.  The 0001 patch basically updates the
FSM as soon as it can after the tuples are actually deleted, so I think
we've made the window as small as practical, and I don't really see a need
to do extra work (and add substantial extra complexity) to deal with
missed cleanup a bit sooner.

People who are dealing with this sort of scenario a lot might be well
advised to reduce autovacuum's maintenance_work_mem, so that the cleanup
cycles happen more often.  That's kind of costly in terms of the number
of index scans, but it reduces the amount of cleanup work that can be
lost to a cancel.

(I'd also argue that a setup such as you describe is very possibly making
things worse not better.  Perhaps the 0001 patch will go some way towards
making it less necessary to do that.)

I've pushed 0001 and a replacement for 0003.  I have to go do something
else right now, but I'll take a closer look at 0004 later.

regards, tom lane



Re: Typo in pg_backup_custom.c

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 3:43 PM, Daniel Gustafsson  wrote:

> Attached fixes a typo, which according to Git is close to being allowed to
> drive.
>

Applied, thanks.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Question about WalSndWriteData

2018-03-29 Thread Konstantin Knizhnik



On 21.03.2018 10:08, Konstantin Knizhnik wrote:



On 21.03.2018 04:50, Peter Eisentraut wrote:

On 3/16/18 12:08, Konstantin Knizhnik wrote:
pq_putmessage_noblock copies data from ctx->out buffer to libpq 
buffers.

After it we write timestamp to ctx->out buffer.
And comments says that we should do it "as late as possible".
But this timestamp is not included in the copy data packet which is
already copied to libpq connection buffer.

There is a pq_flush_if_writable() right after this that will write out
the rest of ctx->out.

Sorry, But PQ_flush_if_writable calls socket_flush_if_writable (or 
mq_flush_if_writable).

This function flushes pqlib connection buffer, i.e. PqSendBuffer.
This buffer has no relation to ctx->out_buffer, where timestamp is 
written.


The obvious fix is to move assignment of timestamp prior to 
pq_putmessage_noblock:



WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, 
TransactionId xid,

                bool last_write)
{
    TimestampTz    now;

    /*
     * Fill the send timestamp last, so that it is taken as late as 
possible.
     * This is somewhat ugly, but the protocol is set as it's already 
used for

     * several releases by streaming physical replication.
     */
    resetStringInfo(&tmpbuf);
    now = GetCurrentTimestamp();
    pq_sendint64(&tmpbuf, now);
    memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
           tmpbuf.data, sizeof(int64));

    /* output previously gathered data in a CopyData packet */
    pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);





Sorry, I have not received confirmation whether it is a bug or not and 
is it going to be fixed.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Fix src/test/subscription/t/003_constraints.pl header comment

2018-03-29 Thread Magnus Hagander
On Thu, Mar 29, 2018 at 1:32 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi!
>
> src/test/subscription/t/003_constraints.pl starts from the line
>
> # Basic logical replication test
>
> It seems that this line was copy-pasted from src/test/subscription/t/001_
> rep_changes.pl.
> Fix is attached.
>
>
Applied with the addition of another "that" in the text. Thanks!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-29 Thread Tomas Vondra
On 03/29/2018 06:42 PM, Tom Lane wrote:
> Simon Riggs  writes:
>> I know the approach is new and surprising but I thought about it a lot
>> before proposing it and I couldn't see any holes; still can't. Please
>> give this some thought so we can get comfortable with this idea and
>> increase performance as a result. Thanks.
> 
> The long and the short of it is that this is a very dangerous-looking
> proposal, we are at the tail end of a development cycle, and there are
> ~100 other patches remaining in the commitfest that also have claims
> on our attention in the short time that's left.  If you're expecting
> people to spend more time thinking about this now, I feel you're being
> unreasonable.
> 

I agree.

> Also, I will say it once more: this change DOES decrease robustness.
> It's like blockchain without the chain aspect, or git commits without
> a parent pointer.  We are not only interested in whether individual
> WAL records are valid, but whether they form a consistent series.
> Cross-checking xl_prev provides some measure of confidence about that;
> xl_curr offers none.
> 

Not sure.

If each WAL record has xl_curr, then we know to which position the
record belongs (after verifying the checksum). And we do know the size
of each WAL record, so we should be able to deduce if two records are
immediately after each other. Which I think is enough to rebuild the
chain of WAL records.

To defeat this, this would need to happen:

a) the WAL record gets written to a different location
b) the xl_curr gets corrupted in sync with (a)
c) the WAL checksum gets corrupted in sync with (b)
d) the record overwrites existing record (same size/boundaries)

That seems very much like xl_prev.


regards
-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Feature Request - DDL deployment with logical replication

2018-03-29 Thread Jeremy Finzel
Hello!

I have not seen much discussion about what the plans are for being able to
manage schema changes when using logical replication.  In our own
infrastructure, mechanisms that have been provided to manage DDL statements
at the same transactional point as they happen on the master have been
immensely useful to us, such as replicate_ddl_command from pglogical.

Although we are thrilled with some of the features already in logical
replication, this missing feature is the #1 reason that we don't plan to
take a serious look at built-in logical replication even for pg11, because
we have been able to use pglogical with our own extension pgl_ddl_deploy in
order to broadly deploy logical replication without serious overhauls to
our SDLC process, having schema changes managed well.  We really want a
mechanism to put through DDL changes at the same transactional point on the
subscribers as we do on the publishers, which also answers any complexities
around deploying master-first or slave-first in some interesting cases.

Is there any particular vision for how the community might address this
need in the future?

Thank you!
Jeremy


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-29 Thread Alvaro Herrera
Teodor Sigaev wrote:
> 
> Alvaro Herrera wrote:
> > I don't quite understand the new call in gininsert -- I mean I see that
> > it wants to check for conflicts even when fastupdate is set, but why?
> If fastupdate is set then we check conflict with whole index, not a
> particular pages in it. Predicate lock on penging list pages will be
> effectively a lock over index, because every scan will begin from pending
> list and each insert will insert into it. I

Oh, right, that makes sense.  I'm not sure that the comments explain
this sufficiently -- I think it'd be good to expand that.


Given this patch, it seems clear that serializable mode is much worse
with fastupdate than without it!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Robert Haas
On Thu, Mar 29, 2018 at 9:44 AM, Tom Lane  wrote:
> I'd go a little further and drop "JIT" from user-facing documentation
> altogether.  Instead refer to the feature as "compilation of expressions"
> or some such.  JIT is just jargon.  Plus, the timing of the compilation is
> actually the least important property for our purpose.

I agree that talking about JIT compilation (or just-in-time
compilation) would be better than talking just about JIT, but refusing
to mention JIT seems counter-productive to me.  There are a lot of
people who know what just-in-time compilation is and will not realize
that "compilation of expressions" refers to any such technology.  If
you don't know what it is, you can Google it.  Just typing "jit" into
Google produces a stupid urban dictionary hit and then this:
https://en.wikipedia.org/wiki/Just-in-time_compilation  -- and that
contains useful information that you'll never find if you search for
"compilation of expressions".

Also, in a way, you could argue that v10 already did "compilation of
expressions".  It didn't compile them to machine language, true, but
it translated them into a form which is faster to execute, and which
is at least arguably a form of bytecode.  It's not going to be clear,
even to an expert, that "compilation of expressions" means something
other than that, but if you say JIT, then all of a sudden people know
what we're talking about.

I agree that JIT is jargon, but it's pretty commonly-used jargon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: pgstat_report_activity() and parallel CREATE INDEX (was: Parallel index creation & pg_stat_activity)

2018-03-29 Thread Phil Florent
Thanks, ran the original test and it works great after the patch.

Phil



De : Robert Haas 
Envoyé : jeudi 22 mars 2018 18:17
À : Peter Geoghegan
Cc : Andres Freund; Phil Florent; PostgreSQL Hackers
Objet : Re: pgstat_report_activity() and parallel CREATE INDEX (was: Parallel 
index creation & pg_stat_activity)

On Wed, Mar 7, 2018 at 8:53 PM, Peter Geoghegan  wrote:
> On Thu, Mar 1, 2018 at 2:47 PM, Peter Geoghegan  wrote:
>> No. Just an oversight. Looks like _bt_parallel_build_main() should
>> call pgstat_report_activity(), just like ParallelQueryMain().
>>
>> I'll come up with a patch soon.
>
> Attached patch has parallel CREATE INDEX propagate debug_query_string
> to workers. Workers go on to use this string as their own
> debug_query_string, as well as registering it using
> pgstat_report_activity(). Parallel CREATE INDEX pg_stat_activity
> entries will have a query text, too, which addresses Phil's complaint.

Committed.  Thanks for the patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Peter Geoghegan
On Thu, Mar 29, 2018 at 2:28 AM, Simon Riggs  wrote:
> I understand we are adding a check to verify heap against index and
> this will take longer than before. When it runs does it report
> progress of the run via pg_stat_activity, so we can monitor how long
> it will take?

My point to Pavan was that the checker function that uses a weaker
lock, bt_index_check(), imitates a CREATE INDEX CONCURRENTLY. However,
unlike CIC, it acquires an ASL, not a ShareUpdateExclusiveLock. I
believe that this is safe, because we only actually perform a process
that's similar to the first heap scan of a CIC, without the other CIC
steps. (The variant that uses a ShareLock, bt_index_parent_check(),
imitates a CREATE INDEX, so no difference in lock strength there.)

amcheck is still just a couple of SQL-callable functions, so the
closest thing that there is to progress monitoring is DEBUG traces,
which I've found to be informative in real-world situations. Of
course, only a well informed user can interpret that. No change there.

> Locking is also an important concern.
>
> If we need a ShareLock to run the extended check and the check runs
> for a long time, when would we decide to run that? This sounds like it
> will cause a production outage, so what are the pre-conditions that
> would lead us to say "we'd better run this". For example, if running
> this is known to be signficantly faster than running CREATE INDEX,
> that might be an argument for someone to run this first if index
> corruption is suspected.

I think that most people will choose to use bt_index_check(), since,
as I said, it only requires an ASL, as opposed to
bt_index_parent_check(), which requires a ShareLock. This is
especially likely because there isn't much difference between how
thorough verification is. The fact that bt_index_check() is likely the
best thing to use for routine verification is documented in the
v10/master version [1], which hasn't changed. That said, there are
definitely environments where nobody cares a jot that a ShareLock is
needed, which is why we have bt_index_parent_check(). Often, amcheck
is used when the damage is already done, so that it can be fully
assessed.

This patch does not really change the interface; it just adds a new
heapallindexed argument, which has a default of 'false'.
bt_index_check() is unlikely to cause too many problems in production.
Heroku ran it on all databases at one point, and it wasn't too much
trouble. Of course, that was a version that lacked this heapallindexed
enhancement, which slows down verification by rather a lot when
actually used. My rough estimate is that heapallindexed verification
makes the process take 5x longer.

> If it detects an issue, can it fix the issue for the index by
> injecting correct entries? If not then we will have to run CREATE
> INDEX afterwards anyway, which makes it more likely that people would
> just run CREATE INDEX and not bother with the check.

It does not fix anything. I think that the new check is far more
likely to find problems in the heap than in the index, which is the
main reason for this.

The new check can only begin to run at the point where the index
structure has been found to be self-consistent, which is the main
reason why it seems like more of a heap checker than an index checker.
Also, that's what it actually found in the couple of interesting cases
that we've seen. It detects "freeze the dead" corruption, at least
with the test cases we have available. It also detects corruption
caused by failing to detect broken HOT chains during an initial CREATE
INDEX; there were two such bugs in CREATE INDEX CONCURRENTLY, one in
2012, and another in 2017, as you'll recall. The 2017 one (the CIC bug
that Pavan found through mechanical testing) went undetected for a
very long time; I think that a tool like this greatly increases our
odds of early detection of that kind of thing.

These are both issues that kind of seem like index corruption, that
are actually better understood as heap corruption. It's subtle.

> So my initial questions are about when we would run this and making
> sure that is documented.

Good question. I find it hard to offer general advice about this, to
be honest. In theory, you should never need to run it, because
everything should work, and in practice that's generally almost true.
I've certainly used it when investigating problems after the fact, and
as a general smoke-test, where it works well. I would perhaps
recommend running it once a week, although that's a fairly arbitrary
choice. The docs in v10 don't take a position on this, so while I tend
to agree we could do better, it is a preexisting issue.

[1] https://www.postgresql.org/docs/10/static/amcheck.html#id-1.11.7.11.7
-- 
Peter Geoghegan



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Andres Freund
On 2018-03-29 13:26:31 -0400, Robert Haas wrote:
> On Thu, Mar 29, 2018 at 9:44 AM, Tom Lane  wrote:
> > I'd go a little further and drop "JIT" from user-facing documentation
> > altogether.  Instead refer to the feature as "compilation of expressions"
> > or some such.  JIT is just jargon.  Plus, the timing of the compilation is
> > actually the least important property for our purpose.
> 
> I agree that talking about JIT compilation (or just-in-time
> compilation) would be better than talking just about JIT, but refusing
> to mention JIT seems counter-productive to me.  There are a lot of
> people who know what just-in-time compilation is and will not realize
> that "compilation of expressions" refers to any such technology.  If
> you don't know what it is, you can Google it.  Just typing "jit" into
> Google produces a stupid urban dictionary hit and then this:
> https://en.wikipedia.org/wiki/Just-in-time_compilation  -- and that
> contains useful information that you'll never find if you search for
> "compilation of expressions".
> 
> Also, in a way, you could argue that v10 already did "compilation of
> expressions".  It didn't compile them to machine language, true, but
> it translated them into a form which is faster to execute, and which
> is at least arguably a form of bytecode.  It's not going to be clear,
> even to an expert, that "compilation of expressions" means something
> other than that, but if you say JIT, then all of a sudden people know
> what we're talking about.
> 
> I agree that JIT is jargon, but it's pretty commonly-used jargon.

Precisely this. I'm very strongly against just saying "expression
compilation", it's just too imprecise. As Robert mentions it could refer
to what we do in v10, it could refer to ahead of time compilation of PL
functions, and it doesn't include compiling tuple deforming.  Nor will
it describe compiling sorting, copy or whatnot.

I'm very open however to replacing JITing with JIT compiling and smilar
substitutions.

Greetings,

Andres Freund



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-29 Thread Robert Haas
On Thu, Mar 29, 2018 at 1:13 PM, Tomas Vondra
 wrote:
> On 03/29/2018 06:42 PM, Tom Lane wrote:
>> The long and the short of it is that this is a very dangerous-looking
>> proposal, we are at the tail end of a development cycle, and there are
>> ~100 other patches remaining in the commitfest that also have claims
>> on our attention in the short time that's left.  If you're expecting
>> people to spend more time thinking about this now, I feel you're being
>> unreasonable.
>
> I agree.

+1.

>> Also, I will say it once more: this change DOES decrease robustness.
>> It's like blockchain without the chain aspect, or git commits without
>> a parent pointer.  We are not only interested in whether individual
>> WAL records are valid, but whether they form a consistent series.
>> Cross-checking xl_prev provides some measure of confidence about that;
>> xl_curr offers none.
>
> Not sure.
>
> If each WAL record has xl_curr, then we know to which position the
> record belongs (after verifying the checksum). And we do know the size
> of each WAL record, so we should be able to deduce if two records are
> immediately after each other. Which I think is enough to rebuild the
> chain of WAL records.
>
> To defeat this, this would need to happen:
>
> a) the WAL record gets written to a different location
> b) the xl_curr gets corrupted in sync with (a)
> c) the WAL checksum gets corrupted in sync with (b)
> d) the record overwrites existing record (same size/boundaries)
>
> That seems very much like xl_prev.

I don't think so, because this ignores, for example, timeline
switches, or multiple clusters accidentally sharing an archive
directory.

TBH, I'm not entirely sure how likely corruption would be under this
proposal in practice, but I think Tom's got the right idea on a
theoretical level.   As he says, there's a reason why things like
block chains and git commits include something about the previous
record in what gets hashed to create the identifier for the new
record.  It ties those things together in a way that doesn't happen
otherwise.  If somebody proposed changing git so that the SHA
identifier for a commit didn't hash the commit ID for the previous
commit, that would break the security of the system in all kinds of
ways: for example, an adversary could maliciously edit an old commit
by just changing its SHA identifier and that of its immediate
descendent.  No other commit IDs would change and integrity checks
would not fail -- there would be no easy way to notice that something
bad had happened.

Now, in our case, the chances of problems are clearly a lot more
remote because of the way that WAL records are packed into files.
Every WAL record has a place where it's supposed to appear in the WAL
stream, and positions in the WAL stream are never intentionally
recycled (except in PITR scenarios, I guess).  Because of that, the
proposed xl_curr check provides a pretty strong cross-check on the
validity of a record even without xl_prev.  I think there is an
argument to be made - as Simon is doing - that this check is in fact
strong enough that it's good enough for government work.  He might be
right.  But he might also be wrong, because I think Tom is
unquestionably correct when he says that this is weakening the check.
What I'm not sure about is whether it's being weakened enough that
it's actually going to cause problems in practice.  Given where we are
in the release cycle, I'd prefer to defer that question to next cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal: http2 wire format

2018-03-29 Thread Andres Freund
On 2018-03-28 20:34:13 -0400, Peter Eisentraut wrote:
> On 3/28/18 12:09, Andres Freund wrote:
> > Yea, not the most descriptive... Returning multiple different resultsets
> > from a function / procedure. Inability to do so is a serious limitation
> > of postgres in comparison to some other language with procedures.
> 
> This is already possible as far as the protocol is concerned.

Huh, I don't see how?

Greetings,

Andres Freund



Re: ALTER TABLE ADD COLUMN fast default

2018-03-29 Thread Andres Freund
Hi,

On 2018-03-29 10:16:23 +1030, Andrew Dunstan wrote:
> On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
> > Hi,
> >
> > On 2018-03-26 09:02:10 +1030, Andrew Dunstan wrote:
> >> Thanks for this, all looks good. Here is the consolidate patch
> >> rebased. If there are no further comments I propose to commit this in
> >> a few days time.
> >
> > Here's a bit of post commit review:
> >
> > @@ -1310,13 +1679,11 @@ slot_getsomeattrs(TupleTableSlot *slot, int attnum)
> >
> > /*
> >  * If tuple doesn't have all the atts indicated by tupleDesc, read 
> > the
> > -* rest as null
> > +* rest as NULLs or missing values
> >  */
> > -   for (; attno < attnum; attno++)
> > -   {
> > -   slot->tts_values[attno] = (Datum) 0;
> > -   slot->tts_isnull[attno] = true;
> > -   }
> > +   if (attno < attnum)
> > +   slot_getmissingattrs(slot, attno, attnum);
> > +
> > slot->tts_nvalid = attnum;
> >  }
> >
> > It's worthwhile to note that this'll re-process *all* missing values,
> > even if they've already been deformed. I.e. if
> > slot_getmissingattrs(.., first-missing)
> > slot_getmissingattrs(.., first-missing + 1)
> > slot_getmissingattrs(.., first-missing + 2)
> > is called, all three missing values will be copied every time. That's
> > because tts_nvalid isn't taken into account.  I wonder if 
> > slot_getmissingattrs
> > could take tts_nvalid into account?
> >
> > I also wonder if this doesn't deserve an unlikely(), to avoid the cost
> > of spilling registers in the hot branch of not missing any values.

> One of us at least is very confused about this function.
> slot_getmissingattrs() gets called at most once by slot_getsomeattrs
> and never for any attribute that's already been deformed.

Imagine the same slot being used in two different expressions. The
typical case for that is e.g. something like
  SELECT a,b,c,d FROM foo WHERE b = 1;

this'll trigger one slot_getsomeattrs(slot, 2) call from within qual
evaluation, and then a slot_getsomeattrs(slot, 4) from within the
projection code.  If you then imagine a tuple where only the first
column exists physically, we'd copy b twice, because attno is only going
to be one 1.  I think we might just want to check tts nvalid in
getmissingattrs?


> > I'm still strongly object to do doing this unconditionally for queries
> > that never need any of this.  We're can end up with a significant number
> > of large tuples in memory here, and then copy that through dozens of
> > tupledesc copies in queries.

> We're just doing the same thing we do for default values.

That's really not a defense. It's hurting us there too.


> None of the tests I did with large numbers of missing values seemed to
> show performance impacts of the kind you describe. Now, none of the
> queries were particularly complex, but the worst case was from
> actually using very large numbers of attributes with missing values,
> where we'd need this data anyway. If we just selected a few attributes
> performance went up rather than down.

Now imagine a partitioned workload with a few thousand partitions over a
few tables. The additional memory is going to start being noticeable.


> pg_attrdef isn't really a good place for it (what if they drop the
> default?). So the only alternative then would be a completely new
> catalog. I'd need a deal of convincing that that was justified.

There's plenty databases with pg_attribute being many gigabytes large,
and this is going to make that even worse.

Greetings,

Andres Freund



Re: Typo in src/backend/access/hash/README

2018-03-29 Thread Bruce Momjian
On Mon, Mar  5, 2018 at 12:23:31PM +0530, Amit Kapila wrote:
> On Mon, Mar 5, 2018 at 9:10 AM, Justin Pryzby  wrote:
> > On Mon, Mar 05, 2018 at 04:30:28PM +1300, Thomas Munro wrote:
> >> Hi,
> >>
> >> Not sure what word was missed here, but I guess "count":
> >>
> >>  our the number of buckets stored in our cached copy of the metapage.  If
> >> -so, the bucket has certainly been split, because the must originally
> >> +so, the bucket has certainly been split, because the count must originally
> >>  have been less than the number of buckets that existed at that time and
> >
> > I think there's a 2nd typo:
> >
> > |After computing the ostensibly-correct bucket number based on our cached
> > |copy of the metapage, we lock the corresponding primary bucket page and
> > |check whether the bucket count stored in hasho_prevblkno is greater than
> > |OUR the number of buckets stored in our cached copy of the metapage.  If
> >
> > remove "our"?
> >
> 
> Both your and Thomas's proposed change looks good to me.

Fixed in the attached, applied patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
new file mode 100644
index bb90722..21b4a82
*** a/src/backend/access/hash/README
--- b/src/backend/access/hash/README
*** reality, InvalidBlockNumber.
*** 189,196 
  After computing the ostensibly-correct bucket number based on our cached
  copy of the metapage, we lock the corresponding primary bucket page and
  check whether the bucket count stored in hasho_prevblkno is greater than
! our the number of buckets stored in our cached copy of the metapage.  If
! so, the bucket has certainly been split, because the must originally
  have been less than the number of buckets that existed at that time and
  can't have increased except due to a split.  If not, the bucket can't have
  been split, because a split would have created a new bucket with a higher
--- 189,196 
  After computing the ostensibly-correct bucket number based on our cached
  copy of the metapage, we lock the corresponding primary bucket page and
  check whether the bucket count stored in hasho_prevblkno is greater than
! the number of buckets stored in our cached copy of the metapage.  If
! so, the bucket has certainly been split, because the count must originally
  have been less than the number of buckets that existed at that time and
  can't have increased except due to a split.  If not, the bucket can't have
  been split, because a split would have created a new bucket with a higher


Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2018-03-29 13:26:31 -0400, Robert Haas wrote:
> > On Thu, Mar 29, 2018 at 9:44 AM, Tom Lane  wrote:
> > > I'd go a little further and drop "JIT" from user-facing documentation
> > > altogether.  Instead refer to the feature as "compilation of expressions"
> > > or some such.  JIT is just jargon.  Plus, the timing of the compilation is
> > > actually the least important property for our purpose.
> > 
> > I agree that talking about JIT compilation (or just-in-time
> > compilation) would be better than talking just about JIT, but refusing
> > to mention JIT seems counter-productive to me.  There are a lot of
> > people who know what just-in-time compilation is and will not realize
> > that "compilation of expressions" refers to any such technology.  If
> > you don't know what it is, you can Google it.  Just typing "jit" into
> > Google produces a stupid urban dictionary hit and then this:
> > https://en.wikipedia.org/wiki/Just-in-time_compilation  -- and that
> > contains useful information that you'll never find if you search for
> > "compilation of expressions".
> > 
> > Also, in a way, you could argue that v10 already did "compilation of
> > expressions".  It didn't compile them to machine language, true, but
> > it translated them into a form which is faster to execute, and which
> > is at least arguably a form of bytecode.  It's not going to be clear,
> > even to an expert, that "compilation of expressions" means something
> > other than that, but if you say JIT, then all of a sudden people know
> > what we're talking about.
> > 
> > I agree that JIT is jargon, but it's pretty commonly-used jargon.
> 
> Precisely this. I'm very strongly against just saying "expression
> compilation", it's just too imprecise. As Robert mentions it could refer
> to what we do in v10, it could refer to ahead of time compilation of PL
> functions, and it doesn't include compiling tuple deforming.  Nor will
> it describe compiling sorting, copy or whatnot.
> 
> I'm very open however to replacing JITing with JIT compiling and smilar
> substitutions.

What we've done elsewhere when there's been similar jargon is to say
something along the lines of:

"compiling of routines (also known as Just-In-Time or JIT compilation)"

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Incorrect use of "an" and "a" in code comments and docs

2018-03-29 Thread Bruce Momjian
On Mon, Mar  5, 2018 at 07:23:22PM +1300, David Rowley wrote:
> On 5 March 2018 at 18:42, Michael Paquier  wrote:
> > On Mon, Mar 05, 2018 at 01:58:54PM +0900, Michael Paquier wrote:
> >> Attached is a proposal of patch to fix all those things.
> >
> > And of course I forgot to attach the actual patch.  (Thanks Thomas for
> > telling me off-list).
> 
> If you're doing a round of that, then you may as well throw [1] into
> the mix. It's gone a bit dead over there, and it does seem like it
> would be good to not have two separate commits for these.
> 
> [1] 
> https://www.postgresql.org/message-id/CAKJS1f81A20mO_A___3sDnTZcZEH-2Pd5oTeiAUb=qo2r9m...@mail.gmail.com

This has been applied already:

commit 3beb46ae8118f64d94518ba105b5e5c79e2ce194
Author: Alvaro Herrera 
Date:   Mon Mar 12 12:58:35 2018 -0300

docs: Fix typo: a -> an

David Rowley

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [HACKERS] [PATCH] Incremental sort

2018-03-29 Thread Alexander Kuzmenkov

Hi Alexander,

I took a quick look at the patch. Some things I fixed myself in the 
attached patch v.21. Here is the summary:


Typo in compare_fractional_path_costs() should be fixed as a separate patch.
Remove unused function estimate_pathkeys_groups.
Extra MemoryContextReset before tuplesort_end() shouldn't be a big deal, 
so we don't have to add a parameter to tuplesoft_free().

Add comment to maincontext declaration.
Fix typo in INITIAL_MEMTUPSIZE.
Remove trailing whitespace.

Some other things I found:

In tuplesort_reset:
if (state->memtupsize < INITIAL_MEMTUPSIZE)
    
I'd add a comment explaining when and why we have to do this. Also maybe 
a comment to other allocations of memtuples in tuplesort_begin() and 
mergeruns(), explaining why it is reallocated and why in maincontext.



In tuplesort_updatemax:
    /* XXX */
    if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
        (spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > 
state->maxSpace))

XXX. Also comparing bools with '>' looks confusing to me.


We should add a comment on top of tuplesort.c, explaining that we now 
have a faster way to sort multiple batches of data using the same sort 
conditions.



The name 'main context' sounds somewhat vague. Maybe 'top context'? Not 
sure.



In ExecSupportBackwardsScan:
        case T_IncrementalSort:
            return false;
This separate case looks useless, I'd either add a comment explaining 
why it can't scan backwards, or just return false by default.




That's all I have for today; tomorrow I'll continue with reviewing the 
planner part of the patch.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2d6e387d63..d11777cb90 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN  
--
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+QUERY PLAN
+--
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
->  Sort
- Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
  ->  Nested Loop
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
->  Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
->  Materialize
- Output: t2.c1
+ Output: t2.c3
  ->  Foreign Scan on public.ft2 t2
-   Output: t2.c1
-   Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+   Output: t2.c3
+   Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+---+---
+ 1 | 00101
+ 1 | 00102
+ 1 | 00103
+ 1 | 00104
+ 1 | 00105
+ 1 | 00106
+ 1 | 00107
+ 1 | 00108
+ 1 | 00109
+ 1 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+QUERY PLAN 
+---
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+ Output: t1.c1, t2.c1
+ Relatio

Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Robert Haas
On Thu, Mar 29, 2018 at 3:00 PM, Stephen Frost  wrote:
> What we've done elsewhere when there's been similar jargon is to say
> something along the lines of:
>
> "compiling of routines (also known as Just-In-Time or JIT compilation)"

That strike me as roughly analogous to saying:

"hiding rows users don't have permission to see (also known as
row-level security)"

IOW, it's just substituting a generic description for a term of art
that has a real meaning.  I can't see that as an improvement.  It
would be possible to compile routines without involving JIT; as I
mentioned earlier, you could compile them to bytecode. Or, you could
compile them to machine code at definition time as, I am told by
Thomas, some database products used to do, I think by generating C
code and then shelling out to "cc".  Just-in-time compilation clearly
conveys that we are generating machine code at runtime; "compiling of
routines" does not.

Besides, if we did that, what are we going to call the GUCs?
compiling_of_routines_above_cost?  compiling_of_routines_provider =
'llvmjit'?  Ugh.

This technique has bene called JIT since about 1983, and is well-known
under that name, mostly because of Java.  Calling it some made-up name
we create ourselves is not going to make this more comprehensible to
users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Add documentation for the JIT feature.

2018-03-29 Thread Andres Freund
On 2018-03-29 15:00:36 -0400, Stephen Frost wrote:
> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2018-03-29 13:26:31 -0400, Robert Haas wrote:
> > > On Thu, Mar 29, 2018 at 9:44 AM, Tom Lane  wrote:
> > > > I'd go a little further and drop "JIT" from user-facing documentation
> > > > altogether.  Instead refer to the feature as "compilation of 
> > > > expressions"
> > > > or some such.  JIT is just jargon.  Plus, the timing of the compilation 
> > > > is
> > > > actually the least important property for our purpose.
> > >
> > > I agree that talking about JIT compilation (or just-in-time
> > > compilation) would be better than talking just about JIT, but refusing
> > > to mention JIT seems counter-productive to me.  There are a lot of
> > > people who know what just-in-time compilation is and will not realize
> > > that "compilation of expressions" refers to any such technology.  If
> > > you don't know what it is, you can Google it.  Just typing "jit" into
> > > Google produces a stupid urban dictionary hit and then this:
> > > https://en.wikipedia.org/wiki/Just-in-time_compilation  -- and that
> > > contains useful information that you'll never find if you search for
> > > "compilation of expressions".
> > >
> > > Also, in a way, you could argue that v10 already did "compilation of
> > > expressions".  It didn't compile them to machine language, true, but
> > > it translated them into a form which is faster to execute, and which
> > > is at least arguably a form of bytecode.  It's not going to be clear,
> > > even to an expert, that "compilation of expressions" means something
> > > other than that, but if you say JIT, then all of a sudden people know
> > > what we're talking about.
> > >
> > > I agree that JIT is jargon, but it's pretty commonly-used jargon.
> >
> > Precisely this. I'm very strongly against just saying "expression
> > compilation", it's just too imprecise. As Robert mentions it could refer
> > to what we do in v10, it could refer to ahead of time compilation of PL
> > functions, and it doesn't include compiling tuple deforming.  Nor will
> > it describe compiling sorting, copy or whatnot.
> >
> > I'm very open however to replacing JITing with JIT compiling and smilar
> > substitutions.
>
> What we've done elsewhere when there's been similar jargon is to say
> something along the lines of:
>
> "compiling of routines (also known as Just-In-Time or JIT compilation)"

There's an entire section explaining what JIT is and what we currently
can JIT.  We can't just say "compiling of routines (also ...)", because
a) we don't do that. b) even if you take expressions being JIT compiled
as that, we also JIT compile tuple deforming. And hopefully will JIT
further things in the not too far away future.

The general config.sgml references and such read
Sets the planner's cutoff above which JIT compilation is used as part
of query execution (see ). Performing
JIT costs time but can accelerate query execution.
it's not used in a standalone manner without references.

and the jit specific section starts with:
  
   Just-in-time compilation (JIT) is the process of turning
   some form of interpreted program evaluation into a native program, and
   doing so at runtime.

   For example, instead of using a facility that can evaluate arbitrary SQL
   expressions to evaluate an SQL predicate like WHERE a.col =
   3, it is possible to generate a function than can be natively
   executed by the CPU that just handles that expression, yielding a speedup.
  

and continues a bit later with
  
   JIT Accelerated Operations
   
Currently PostgreSQL's JIT
implementation has support for accelerating expression evaluation and
tuple deforming.  Several other operations could be accelerated in the
future.
   
   
Expression evaluation is used to evaluate WHERE
clauses, target lists, aggregates and projections. It can be accelerated
by generating code specific to each case.
   
   
Tuple deforming is the process of transforming an on-disk tuple (see ) into its in-memory representation. It can be
accelerated by creating a function specific to the table layout and the
number of columns to be extracted.
   
  

I think there's a few references to standalone JIT that deserve to be
replaced with JIT compilation and similar.

Greetings,

Andres Freund



Re: Incorrect use of "an" and "a" in code comments and docs

2018-03-29 Thread Bruce Momjian
On Mon, Mar  5, 2018 at 01:58:54PM +0900, Michael Paquier wrote:
> Hi all,
> 
> While looking at something else, I have one one occurence of $subject.
> Looking more deeply at the code I have found 10 more of them, like:
> - sinval is a signal invalidation, so it seems to me that "a" is
> correct, not "an".
> - I bumped into "an" being used instead of "and" in the GIN code.
> - config/c-library.m4 also includes one.  If that code is from somewhere
> else, fixing this typo would not be appropriate perhaps?
> 
> Attached is a proposal of patch to fix all those things.

Attached patch applied.  It includes a suggested fix by Abhijit
Menon-Sen.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/config/c-library.m4 b/config/c-library.m4
new file mode 100644
index 9c2207b..34b2508
*** a/config/c-library.m4
--- b/config/c-library.m4
*** AC_DEFUN([PGAC_STRUCT_ADDRINFO],
*** 177,183 
  # handle ll, q, and I64.  The result is in shell variable
  # LONG_LONG_INT_MODIFIER.
  #
! # MinGW uses '%I64d', though gcc throws an warning with -Wall,
  # while '%lld' doesn't generate a warning, but doesn't work.
  #
  AC_DEFUN([PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER],
--- 177,183 
  # handle ll, q, and I64.  The result is in shell variable
  # LONG_LONG_INT_MODIFIER.
  #
! # MinGW uses '%I64d', though gcc throws a warning with -Wall,
  # while '%lld' doesn't generate a warning, but doesn't work.
  #
  AC_DEFUN([PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER],
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
new file mode 100644
index 398532d..630d6a7
*** a/src/backend/access/gin/ginvacuum.c
--- b/src/backend/access/gin/ginvacuum.c
*** ginVacuumPostingTreeLeaves(GinVacuumStat
*** 381,387 
  
  		/*
  		 * All subtree is empty - just return true to indicate that parent
! 		 * must do a cleanup. Unless we are ROOT an there is way to go upper.
  		 */
  
  		if (hasEmptyChild && !hasNonEmptyChild && !isRoot)
--- 381,387 
  
  		/*
  		 * All subtree is empty - just return true to indicate that parent
! 		 * must do a cleanup, unless we are ROOT and there is way to go upper.
  		 */
  
  		if (hasEmptyChild && !hasNonEmptyChild && !isRoot)
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
new file mode 100644
index 8300057..0ace196
*** a/src/backend/catalog/aclchk.c
--- b/src/backend/catalog/aclchk.c
*** pg_extension_ownercheck(Oid ext_oid, Oid
*** 5280,5286 
  }
  
  /*
!  * Ownership check for an publication (specified by OID).
   */
  bool
  pg_publication_ownercheck(Oid pub_oid, Oid roleid)
--- 5280,5286 
  }
  
  /*
!  * Ownership check for a publication (specified by OID).
   */
  bool
  pg_publication_ownercheck(Oid pub_oid, Oid roleid)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
new file mode 100644
index 639b699..0f844c0
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
*** finish_heap_swap(Oid OIDOldHeap, Oid OID
*** 1539,1545 
  		frozenXid, cutoffMulti, mapped_tables);
  
  	/*
! 	 * If it's a system catalog, queue an sinval message to flush all
  	 * catcaches on the catalog when we reach CommandCounterIncrement.
  	 */
  	if (is_system_catalog)
--- 1539,1545 
  		frozenXid, cutoffMulti, mapped_tables);
  
  	/*
! 	 * If it's a system catalog, queue a sinval message to flush all
  	 * catcaches on the catalog when we reach CommandCounterIncrement.
  	 */
  	if (is_system_catalog)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
new file mode 100644
index a2d9381..86252ce
*** a/src/backend/executor/instrument.c
--- b/src/backend/executor/instrument.c
*** InstrAlloc(int n, int instrument_options
*** 49,55 
  	return instr;
  }
  
! /* Initialize an pre-allocated instrumentation structure. */
  void
  InstrInit(Instrumentation *instr, int instrument_options)
  {
--- 49,55 
  	return instr;
  }
  
! /* Initialize a pre-allocated instrumentation structure. */
  void
  InstrInit(Instrumentation *instr, int instrument_options)
  {
diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
new file mode 100644
index 30145b9..1c7e990
*** a/src/backend/libpq/pqformat.c
--- b/src/backend/libpq/pqformat.c
*** pq_beginmessage(StringInfo buf, char msg
*** 100,106 
  
   *		pq_beginmessage_reuse - initialize for sending a message, reuse buffer
   *
!  * This requires the buffer to be allocated in an sufficiently long-lived
   * memory context.
   * 
   */
--- 100,106 
  
   *		pq_beginmessage_reuse - initialize for sending a message, reuse buffer
   *
!  * This requires the buffer to be allocated in a sufficiently long-lived
   * 

Re: Incorrect use of "an" and "a" in code comments and docs

2018-03-29 Thread Bruce Momjian
On Mon, Mar  5, 2018 at 07:42:40PM +1300, Thomas Munro wrote:
> On Mon, Mar 5, 2018 at 6:51 PM, Abhijit Menon-Sen  
> wrote:
> > At 2018-03-05 14:42:14 +0900, mich...@paquier.xyz wrote:
> >> > - sinval is a signal invalidation, so it seems to me that "a" is
> >> > correct, not "an".
> >
> > I guess it depends on whether you read it as "sin-val" or "ess-inval".
> 
> $ git grep ' a SQL ' | wc -l
>  642
> $ git grep ' an SQL ' | wc -l
>  219
> 
> /me grabs popcorn

I am planning to look into this, and will convert to "an" in my proposed
patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



  1   2   >