Re: Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-05 Thread Etsuro Fujita

(2019/03/02 1:10), Robert Haas wrote:

On Fri, Mar 1, 2019 at 5:47 AM Etsuro Fujita
  wrote:

Robert, I CCed you because you are the author of that commit.  Before
that commit ("Rewrite the code that applies scan/join targets to
paths."), apply_scanjoin_target_to_paths() had a boolean parameter named
modify_in_place, and used apply_projection_to_path(), not
create_projection_path(), to adjust scan/join paths when modify_in_place
was true, which allowed us to save cycles at plan creation time by
avoiding creating projection paths, which I think would be a good thing,
but that commit removed that.  Why?


One of the goals of the commit was to properly account for the cost of
computing the target list.  Before parallel query and partition-wise
join, it didn't really matter what the cost of computing the target
list was, because every path was going to have to do the same work, so
it was just a constant factor getting added to every path.  However,
parallel query and partition-wise join mean that some paths can
compute the final target list more cheaply than others, and that turns
out to be important for things like PostGIS.  One of the complaints
that provoked that change was that PostGIS was picking non-parallel
plans even when a parallel plan was substantially superior, because it
wasn't accounting for the fact that in the parallel plan, the cost of
computing the target-list could be shared across all the workers
rather than paid entirely by the leader.

In order to accomplish this goal of properly accounting for the cost
of computing the target list, we need to create a new path, not just
jam the target list into an already-costed path.  Note that we did
some performance optimization around the same time to minimize the
performance hit here (see d7c19e62a8e0a634eb6b29f8fd944e57081f,
and I think there may have been something else as well although I
can't find it right now).


apply_projection_to_path() not only jams the given tlist into the 
existing path but updates its tlist eval costs appropriately except for 
the cases of Gather and GatherMerge:


/*
 * If the path happens to be a Gather or GatherMerge path, we'd like to
 * arrange for the subpath to return the required target list so that
 * workers can help project.  But if there is something that is not
 * parallel-safe in the target expressions, then we can't.
 */
if ((IsA(path, GatherPath) ||IsA(path, GatherMergePath)) &&
is_parallel_safe(root, (Node *) target->exprs))
{
/*
 * We always use create_projection_path here, even if the 
subpath is
 * projection-capable, so as to avoid modifying the subpath in 
place.

 * It seems unlikely at present that there could be any other
 * references to the subpath, but better safe than sorry.
 *
-->  * Note that we don't change the parallel path's cost estimates; it
-->  * might be appropriate to do so, to reflect the fact that the 
bulk of

-->  * the target evaluation will happen in workers.
 */
if (IsA(path, GatherPath))
{
GatherPath *gpath = (GatherPath *) path;

gpath->subpath = (Path *)
create_projection_path(root,
   gpath->subpath->parent,
   gpath->subpath,
   target);
}
else
{
GatherMergePath *gmpath = (GatherMergePath *) path;

gmpath->subpath = (Path *)
create_projection_path(root,
   gmpath->subpath->parent,
   gmpath->subpath,
   target);
}
}


The real reason for this question is: I noticed that projection paths
put on foreign paths will make it hard for FDWs to detect whether there
is an already-well-enough-sorted remote path in the pathlist for the
final scan/join relation as an input relation to GetForeignUpperPaths()
for the UPPERREL_ORDERED step (the IsA(path, ForeignPath) test would not
work well enough to detect remote paths!), so I'm wondering whether we
could revive that parameter like the attached, to avoid the overhead at
plan creation time and to make the FDW work easy.  Maybe I'm missing
something, though.


I think this would be a bad idea, for the reasons explained above.  I
also think that it's probably the wrong direction on principle.  I
think the way we account for target lists is still pretty crude and
needs to be thought of as more of a real planning step and less as
something that we can just ignore when it's inconvenient for some
reason.


I'm not sure I 100% agree with you, but I also think we need to give 
more thought to the tlist-eval-cost adjustment.



I think the FDW just needs to look through the projection
path and see what's underneath, but also take the projection path's
target list into account when decide wh

Re: Re: [PATCH] pgbench tap tests fail if the path contains a perl special character

2019-03-05 Thread David Steele

On 2/8/19 3:30 PM, Raúl Marín wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Tested `pgbench-tap-glob-fix-1.patch` with a "+" in the path and the tap tests 
now pass.


Is this now ready for committer or is there anything else Fabien needs 
to do?


Marking the patch as either Ready for Committer or Waiting on Author 
will help move things along.


Regards,
--
-David
da...@pgmasters.net



Re: Problems with plan estimates in postgres_fdw

2019-03-05 Thread Etsuro Fujita

(2019/03/02 1:14), Antonin Houska wrote:

Etsuro Fujita  wrote:

(2019/03/01 20:00), Antonin Houska wrote:



Sorry, my explanation was not enough again, but I showed that query ("SELECT
a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;") to explain why
the following code bit is needed:

+   /*
+* If this includes an UPPERREL_ORDERED step, the given target, which
+* would be the final target to be applied to the resulting path,
might
+* have different expressions from the underlying relation's reltarget
+* (see make_sort_input_target()); adjust tlist eval costs.
+*/
+   if (fpextra&&   fpextra->target != foreignrel->reltarget)
+   {
+   QualCostoldcost = foreignrel->reltarget->cost;
+   QualCostnewcost = fpextra->target->cost;
+
+   startup_cost += newcost.startup - oldcost.startup;
+   total_cost += newcost.startup - oldcost.startup;
+   total_cost += (newcost.per_tuple - oldcost.per_tuple) * rows;
+   }


Maybe I undestand now. Do the expressions (newcost.* - oldcost.*) reflect the
fact that, for the query

SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;

the UPPERREL_ORDERED stage only needs to evaluate the random() function
because (a + b) was already evaluated during the UPPERREL_GROUP_AGG stage?


Yeah, I think so.

Best regards,
Etsuro Fujita




Re: Re: NOT IN subquery optimization

2019-03-05 Thread David Steele

On 2/27/19 2:26 AM, David Rowley wrote:

On Wed, 27 Feb 2019 at 13:13, Tom Lane  wrote:


"Li, Zheng"  writes:

However, given that the March CommitFest is imminent and the runtime smarts 
patent concerns David had pointed out (which I was not aware of before), we 
would not move that direction at the moment.



I propose that we collaborate to build one patch from the two patches submitted 
in this thread for the CF.


TBH, I think it's very unlikely that any patch for this will be seriously
considered for commit in v12.  It would be against project policy, and
spending a lot of time reviewing the patch would be quite unfair to other
patches that have been in the queue longer.  Therefore, I'd suggest that
you not bend things out of shape just to have some patch to submit before
March 1.  It'd be better to work with the goal of having a coherent patch
ready for the first v13 CF, probably July-ish.


FWIW, I did add this to the March CF, but I set the target version to
13.  I wasn't considering this for PG12. I see Zheng was, but I agree
with you on PG13 being the target for this.


Looks like the target version of 13 was removed but I have added it back.

Regards,
--
-David
da...@pgmasters.net



Re: Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread David Steele

On 3/1/19 7:48 AM, Etsuro Fujita wrote:

(2019/03/01 14:17), Tatsuro Yamada wrote:

Attached patch is wip patch.


Is it possible to remove the following patch?
Because I registered the patch twice on CF Mar.

https://commitfest.postgresql.org/22/2049/


Please remove the above and keep this:

https://commitfest.postgresql.org/22/1779/

which I moved from the January CF to the March one on behalf of him.


I have closed the duplicate entry (#2049) and retained the entry which 
contains the CF history (#1779).


Regards,
--
-David
da...@pgmasters.net



Re: Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-03-05 Thread David Steele

Hi Paul,

On 11/24/18 4:55 AM, Paul A Jungwirth wrote:

On Fri, Nov 23, 2018 at 3:41 PM Paul A Jungwirth
 wrote:

Here is a patch for my progress on this so far.


Well this is embarrassing, but my last patch used the mistaken syntax
`PRIMARY KEY (cols, WITHOUT OVERLAPS col)`. Here is a new patch which
uses the correct syntax `PRIMARY KEY (cols, col WITHOUT OVERLAPS)`.
Sorry about that! Also I went ahead and rebased it off current master.


I have marked this patch as targeting PG13 since it is clearly not 
material for PG12.  I also added you as the patch author.


Regards,
--
-David
da...@pgmasters.net



Re: Problem with default partition pruning

2019-03-05 Thread yuzuko
Imai-san,

Thanks for sharing your tests!

On Thu, Feb 28, 2019 at 5:27 PM Imai, Yoshikazu
 wrote:
>
> Hosoya-san
>
> On Wed, Feb 27, 2019 at 6:51 AM, Yuzuko Hosoya wrote:
> > > From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> > > Sent: Wednesday, February 27, 2019 11:22 AM
> > >
> > > Hosoya-san,
> > >
> > > On 2019/02/22 17:14, Yuzuko Hosoya wrote:
> > > > Hi,
> > > >
> > > > I found the bug of default partition pruning when executing a range
> > query.
> > > >
> > > > -
> > > > postgres=# create table test1(id int, val text) partition by range
> > > > (id); postgres=# create table test1_1 partition of test1 for values
> > > > from (0) to (100); postgres=# create table test1_2 partition of
> > > > test1 for values from (150) to (200); postgres=# create table
> > > > test1_def partition of test1 default;
> > > >
> > > > postgres=# explain select * from test1 where id > 0 and id < 30;
> > > >QUERY PLAN
> > > > 
> > > >  Append  (cost=0.00..11.83 rows=59 width=11)
> > > >->  Seq Scan on test1_1  (cost=0.00..5.00 rows=58 width=11)
> > > >  Filter: ((id > 0) AND (id < 30))
> > > >->  Seq Scan on test1_def  (cost=0.00..6.53 rows=1 width=12)
> > > >  Filter: ((id > 0) AND (id < 30))
> > > > (5 rows)
> > > >
> > > > There is no need to scan the default partition, but it's scanned.
> > > > -
> > > >
> > > > In the current implement, whether the default partition is scanned
> > > > or not is determined according to each condition of given WHERE
> > > > clause at get_matching_range_bounds().  In this example,
> > > > scan_default is set true according to id > 0 because id >= 200
> > > > matches the default partition.  Similarly, according to id < 30,
> > scan_default is set true.
> > > > Then, these results are combined according to AND/OR at
> > perform_pruning_combine_step().
> > > > In this case, final result's scan_default is set true.
> > > >
> > > > The modifications I made are as follows:
> > > > - get_matching_range_bounds() determines only offsets of range bounds
> > > >   according to each condition
> > > > - These results are combined at perform_pruning_combine_step()
> > > > - Whether the default partition is scanned or not is determined at
> > > >   get_matching_partitions()
> > > >
> > > > Attached the patch.  Any feedback is greatly appreciated.
> > >
> > > Thank you for reporting.  Can you please add this to March CF in Bugs
> > > category so as not to lose
> > track
> > > of this?
> > >
> > > I will try to send review comments soon.
> > >
> > Thank you for your reply.  I added this to March CF.
>
> I tested with simple use case and I confirmed it works correctly like below.
>
> In case using between clause:
> postgres=# create table test1(id int, val text) partition by range (id);
> postgres=# create table test1_1 partition of test1 for values from (0) to 
> (100);
> postgres=# create table test1_2 partition of test1 for values from (150) to 
> (200);
> postgres=# create table test1_def partition of test1 default;
>
> [HEAD]
> postgres=# explain analyze select * from test1 where id between 0 and 50;
> QUERY PLAN
> ---
>  Append  (cost=0.00..58.16 rows=12 width=36) (actual time=0.008..0.008 rows=0 
> loops=1)
>->  Seq Scan on test1_1  (cost=0.00..29.05 rows=6 width=36) (actual 
> time=0.005..0.005 rows=0 loops=1)
>  Filter: ((id >= 0) AND (id <= 50))
>->  Seq Scan on test1_def  (cost=0.00..29.05 rows=6 width=36) (actual 
> time=0.002..0.002 rows=0 loops=1)
>  Filter: ((id >= 0) AND (id <= 50))
>
>
> [patched]
> postgres=# explain analyze select * from test1 where id between 0 and 50;
>QUERY PLAN
> -
>  Append  (cost=0.00..29.08 rows=6 width=36) (actual time=0.006..0.006 rows=0 
> loops=1)
>->  Seq Scan on test1_1  (cost=0.00..29.05 rows=6 width=36) (actual 
> time=0.004..0.005 rows=0 loops=1)
>  Filter: ((id >= 0) AND (id <= 50))
>
>
>
> I considered about another use case. If default partition contains rows whose 
> id = 300 and then we add another partition which have constraints like id >= 
> 300 and id < 400, I thought we won't scan the rows anymore. But I noticed we 
> simply can't add such a partition.
>
> postgres=# insert into test1 values (300);
> INSERT 0 1
> postgres=# create table test1_3 partition of test1 for values from (300) to 
> (400);
> ERROR:  updated partition constraint for default partition "test1_def" would 
> be violated by some row
>
>
> So I haven't come up with bad cases so far :)

I didn't test cases you mentioned.
Thanks to you, I could check correctness of the patch!

-- 
Best regards,
Yuzuko Hosoya
NTT

Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Tatsuro Yamada

Hi David,

On 2019/03/05 17:29, David Steele wrote:

On 3/1/19 7:48 AM, Etsuro Fujita wrote:

(2019/03/01 14:17), Tatsuro Yamada wrote:

Attached patch is wip patch.


Is it possible to remove the following patch?
Because I registered the patch twice on CF Mar.

https://commitfest.postgresql.org/22/2049/


Please remove the above and keep this:

https://commitfest.postgresql.org/22/1779/

which I moved from the January CF to the March one on behalf of him.


I have closed the duplicate entry (#2049) and retained the entry which contains 
the CF history (#1779).



Thank you! :)


Regards,
Tatsuro Yamada






Re: Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2019-03-05 Thread David Steele

Hi Peter,

On 2/28/19 11:43 PM, Peter Moser wrote:

Dear all,
we rebased our temporal normalization patch on top of 
554ebf687852d045f0418d3242b978b49f160f44 from 2019-02-28.


I will also add this prototype (WIP) patch to the commitfest of March, 
as suggested by two developers met at the

FOSDEM some weeks ago.


I have marked this entry as targeting PG13 since it is too late to 
consider for PG12.


Regards,
--
-David
da...@pgmasters.net



Re: Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Tue, 5 Mar 2019 at 21:21, David Steele  wrote:
>
> On 2/27/19 2:26 AM, David Rowley wrote:
> > FWIW, I did add this to the March CF, but I set the target version to
> > 13.  I wasn't considering this for PG12. I see Zheng was, but I agree
> > with you on PG13 being the target for this.
>
> Looks like the target version of 13 was removed but I have added it back.

The problem seems to be that there are now 2 CF entries for this
thread. I originally added [1], but later Zheng added [2].  From what
Jim mentioned when he opened this thread I had the idea that no patch
existed yet, so I posted the one I already had written for this 4
years ago thinking that might be useful to base new work on.  I guess
Zheng's patch already exists when Jim opened this thread as a patch
appeared fairly quickly afterwards.  If that's true then I understand
that they wouldn't want to drop the work they'd already done in favour
of picking mine up.

I'm not all that sure what do to about this. It's going to cause quite
a bit of confusion having two patches in one thread. Part of me feels
that I've hijacked this thread and that I should just drop my patch
altogether and help review Zheng's patch, but I'm struggling a bit to
do that as I've not managed to find problems with my version, but a
few have been pointed out with the other patch  (of course, there may
be some yet undiscovered issues with my version too).

Alternatively, I could take my patch to another thread, but there does
not seem to be much sense in that. It might not solve the confusion
problem.  The best thing would be that if we could work together to
make this work, however, we both seem to have fairly different ideas
on how it should work. Tom and I both agree that Zheng and Jim's
proposal to add OR x IS NULL clauses to the join condition is most
likely a no go area due to it disallowing hash and merge anti-joins.
The last I can understand from Jim is that they appear to disagree
with that and want to do the transformation based on costs.  Perhaps
they're working on some new ideas to make that more feasible. I'm
interested to hear the latest on this.

[1] https://commitfest.postgresql.org/22/2020/
[2] https://commitfest.postgresql.org/22/2023/

--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



RE: libpq debug log

2019-03-05 Thread Iwata, Aya
Hi Horiguchi-san,

Thank you for your reply and suggestions.

> > 1. Expand PQtrace() facility and improve libpq logging.
> >
> > 2. Prepare "output level". There are 3 type of levels;
> > - TRADITIONAL   :  if set, outputs protocol messages
> > - LEVEL1:  if set, outputs phase and time
> > - LEVEL2:  if set, outputs both info TRADITIONAL and 
> > LEVEL1
> 
> You appear to want to segregate the "traditional" output from what you want
> to emit. At least Tom is explicitly suggesting to throw away the hypothtical
> use cases for it. You may sort out what kind of information you/we want to
> emit as log messages from scratch:p
> 
> You may organize log kinds into hierachical levels like server log messages
> or into orthogonal types that are individually turned on. But it is not
> necessarily be a parameter of a logging processor. (mentioned below)

It is intended for old application users who use PQtrace() expect the 
existing/default/traditional log output style.
That’s why I separated other information(ex. phase and timestamp).
But since you mentioned the level is not necessary, 
I will follow your advice and include those information in the reformatted 
PQtrace().


> > 3. Add "output phase" information; This is the processing flow. (ex.
> > When PQexec() start and end )
> 
> What is the advantage of it against just two independent messages like
> PQexec_start and PQexec_end? (I don't see any advantage.)

I think the purpose of this logging improvement is to make it useful for 
analysis at performance deterioration.
When query delay happens, we want to know from which side(server or client) is 
the cause of it, 
and then people want to know which process takes time.
I think the phase and time information are useful for diagnosis.
For example, when command processing function (ex. PQexec()) etc. start/end 
and when client receive/send protocol messages.
/*my intended output here */


> > 4. Change output method to callback style; Default is stdout, and prepare
> other callback functions that will be used more frequently.
> 
> Are you going to make something less-used the default behavior? I think no
> one is opposing rich functionality as far as it is replaceable.
I am sorry, my explanation was not clear.
I just wanted to say I intend to provide several output method functions which 
users likely need.
ex. output to stdout, or output to file, or output to log directory.

 
> > 5. Initialization method;
> > In current one: PQtrace(PGconn *conn, FILE *stream); Proposed change:
> > PQtraceEx(PGconn *conn, FILE *stream, PQloggingProcessor callback_func
> > , void *callback_arg, PGLogminlevel level);
> 
> I'm not sure we should add a new *EX() function. Couldn't we just change the
> signature of PQtrace()?
I intended to add new *EX() function for compatibility purposes specially for 
old version of libpq. 
I would like to avoid making changes to old applications for this.
But if we insist on changing the PQtrace() itself, then I will follow your 
advice.

> 
> callback_funs seems to be a single function. I think it's better to have
> individual function for each message type.  Not
> callback_func(PQLOG_EXEC_START, param_1, param_2,...)  ,but
> PQloggingProcessor.PQexec_start(param_1, param_2, ...).
> 
> It is because the caller can simply pass values in its own type to the 
> function
> without casting or other transformations and their types are checked
> statically.
> 
> I also think it's better that logger-specific paramters are not passed in
> this level.  Maybe stream and level are logger-specific paratmer, which can
> be combined into callback_arg, or can be given via an API function.

Thank you for your advice. I will consider it.

Regards,
Aya Iwata




Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Tatsuro Yamada

Hi Robert!

On 2019/03/05 11:35, Robert Haas wrote:

On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
 wrote:

=== Current design ===

CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
Depending on which one is chosen, the command will proceed in the
following sequence of phases:

* Scan method: Seq Scan
  0. initializing (*2)
  1. seq scanning heap(*1)
  3. sorting tuples   (*2)
  4. writing new heap (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

* Scan method: Index Scan
  0. initializing (*2)
  2. index scanning heap  (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

VACUUM FULL command will proceed in the following sequence of phases:

  1. seq scanning heap(*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

(*1): increasing the value in heap_tuples_scanned column
(*2): only shows the phase in the phase column


All of that sounds good.


The view provides the information of CLUSTER command progress details as follows
# \d pg_stat_progress_cluster
View "pg_catalog.pg_stat_progress_cluster"
Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
   pid   | integer |   |  |
   datid | oid |   |  |
   datname   | name|   |  |
   relid | oid |   |  |
   command   | text|   |  |
   phase | text|   |  |
   cluster_index_relid   | bigint  |   |  |
   heap_tuples_scanned   | bigint  |   |  |
   heap_tuples_vacuumed  | bigint  |   |  |


Still not sure if we need heap_tuples_vacuumed.  We could try to
report heap_blks_scanned and heap_blks_total like we do for VACUUM, if
we're using a Seq Scan.


I have no strong opinion to add heap_tuples_vacuumed, so I'll remove that in
next patch.

Regarding heap_blks_scanned and heap_blks_total, I suppose that it is able to
get those from initscan(). I'll investigate it more.

cluster.c
  copy_heap_data()
heap_beginscan()
  heap_beginscan_internal()
initscan()




=== Discussion points ===

   - Progress counter for "3. sorting tuples" phase
  - Should we add pgstat_progress_update_param() in tuplesort.c like a
"trace_sort"?
Thanks to Peter Geoghegan for the useful advice!


How would we avoid an abstraction violation?


Hmm... What do you mean an abstraction violation?
If it is difficult to solve, I'd not like to add the progress counter for the 
sorting tuples.



   - Progress counter for "6. rebuilding index" phase
  - Should we add "index_vacuum_count" in the view like a vacuum progress 
monitor?
If yes, I'll add pgstat_progress_update_param() to reindex_relation() 
of index.c.
However, I'm not sure whether it is okay or not.


Doesn't seem unreasonable to me.


I see, I'll add it later.


Regards,
Tatsuro Yamada







Re: NOT IN subquery optimization

2019-03-05 Thread David Steele

On 3/5/19 10:53 AM, David Rowley wrote:

On Tue, 5 Mar 2019 at 21:21, David Steele  wrote:


On 2/27/19 2:26 AM, David Rowley wrote:

FWIW, I did add this to the March CF, but I set the target version to
13.  I wasn't considering this for PG12. I see Zheng was, but I agree
with you on PG13 being the target for this.


Looks like the target version of 13 was removed but I have added it back.


The problem seems to be that there are now 2 CF entries for this
thread. I originally added [1], but later Zheng added [2].  From what
Jim mentioned when he opened this thread I had the idea that no patch
existed yet, so I posted the one I already had written for this 4
years ago thinking that might be useful to base new work on.


Yeah, I just figured this out when I got to your patch which was 
properly marked as PG13 and then saw they were pointing at the same thread.


At the very least one of the patch entries should be closed, or moved to 
a new thread.


I'm not sure if I have an issue with competing patches on the same 
thread.  I've seen that before and it can lead to a good outcome.  It 
case, as you say, also lead to confusion.


Regards,
--
-David
da...@pgmasters.net



Re: Problem with default partition pruning

2019-03-05 Thread yuzuko
Hi Ibrar,

On Tue, Mar 5, 2019 at 2:37 AM Ibrar Ahmed  wrote:
>
> Hi  Yuzuko Hosoya,
>
> Ignore my last message, I think this is also a legitimate scan on default 
> partition.
>
Oh, I got it.  Thanks a lot.

>
> On Mon, Mar 4, 2019 at 10:29 PM Ibrar Ahmed  wrote:
>>
>> Hi
>>
>> Patch work fine to me, but I have one test case where default partition 
>> still scanned.
>>
>> postgres=# explain select * from test1 where (id < 10) and true;
>> QUERY PLAN
>> ---
>>  Append  (cost=0.00..55.98 rows=846 width=36)
>>->  Seq Scan on test1_1  (cost=0.00..25.88 rows=423 width=36)
>>  Filter: (id < 10)
>>->  Seq Scan on test1_def  (cost=0.00..25.88 rows=423 width=36)
>>  Filter: (id < 10)
>> (5 rows)
>
>
>
> --
> Ibrar Ahmed

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center



Re: [PATCH 0/3] Introduce spgist quadtree @<(point,circle) operator

2019-03-05 Thread Matwey V. Kornilov
сб, 2 мар. 2019 г. в 12:14, Alexander Korotkov :
>
> Hi!

Thanks for reply. Comments and questions are below.

>
> On Fri, Feb 1, 2019 at 7:08 PM Matwey V. Kornilov
>  wrote:
> > This patch series is to add support for spgist quadtree @<(point,circle)
> > operator. The first two patches are to refactor existing code before
> > implemention the new feature. The third commit is the actual implementation
> > provided with a set of simple unit tests.
>
> Cool!
>
> > Matwey V. Kornilov (3):
> >   Introduce helper variable in spgquadtreeproc.c
> >   Introduce spg_quad_inner_consistent_box_helper() in spgquadtreeproc.c
> >   Add initial support for spgist quadtree @<(point,circle) operator
>
> At first, I have to note that it's not necessary to post every patch
> in separate message.  It would be both easier and comfortable for
> readers if you just put your patches as multiple attachments to the
> same email message.

Ok, sorry, git send-email does embedded patches by default.
Would it be even easier for you next time to receive github repo url
and branch to pull from?

>
> Regarding the patchset itself
>  * spg_quad_inner_consistent_circle_helper() definitely needs comments.

Ok

>  * In PostgreSQL we require that index scan produce exactly same
> results as sequence scan.  Can we ensure this is so for
> @<(point,circle) operator even in corner cases of rounding error?

I am not sure that fully understand the issue here.
The operator is registered to have amopfamily =
'spgist/quad_point_ops', how could we use it without the index to do
sequence scan?

>  * In our coding style we have function name is the separate line from
> its return type.

Ok

>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company



-- 
With best regards,
Matwey V. Kornilov



Re: extension patch of CREATE OR REPLACE TRIGGER

2019-03-05 Thread David Steele

On 2/28/19 10:43 AM, Osumi, Takamichi wrote:


One past thread about introducing CREATE OR REPLACE TRIGGER into the syntax

had stopped without complete discussion in terms of LOCK level.

The past thread is this. I'd like to inherit this one.


Since this patch landed at the last moment in the last commitfest for  
PG12 I have marked it as targeting PG13.


Regards,
--
-David
da...@pgmasters.net



Re: Problems with plan estimates in postgres_fdw

2019-03-05 Thread Etsuro Fujita

(2019/03/04 16:46), Antonin Houska wrote:

Etsuro Fujita  wrote:

(2019/03/01 20:00), Antonin Houska wrote:



It's still unclear to me why add_foreign_ordered_paths() passes the input
relation (input_rel) to estimate_path_cost_size(). If it passed the output rel
(i.e. ordered_rel in this case) like add_foreign_grouping_paths() does, then
foreignrel->reltarget should contain the random() function. (What I see now is
that create_ordered_paths() leaves ordered_rel->reltarget empty, but that's
another problem to be fixed.)

Do you still see a reason to call estimate_path_cost_size() this way?


Yeah, the reason for that is because we can estimate the costs of sorting for
the final scan/join relation using the existing code as-is that estimates the
costs of sorting for base or join relations, except for tlist eval cost
adjustment.  As you mentioned, we could pass ordered_rel to
estimate_path_cost_size(), but if so, I think we would need to get input_rel
(ie, the final scan/join relation) from ordered_rel within
estimate_path_cost_size(), to use that code, which would not be great.


After some more thought I suggest that estimate_path_cost_size() is redesigned
before your patch gets applied. The function is quite hard to read if - with
your patch applied - the "foreignrel" argument sometimes means the output
relation and other times the input one. I takes some effort to "switch
context" between these two perspectives when I read the code.


I have to admit that my proposal makes estimate_path_cost_size() 
complicated to a certain extent, but I don't think it changes the 
meaning of the foreignrel; even in my proposal, the foreignrel should be 
considered as an output relation rather than an input relation, because 
we create a foreign upper path with the foreignrel being the parent 
RelOptInfo for the foreign upper path in add_foreign_ordered_paths() or 
add_foreign_final_paths().



Perhaps both input and output relation should be passed to the function now,
and maybe also UpperRelationKind of the output relation.


My concern is: that would make it inconvenient to call that function.


And maybe it's even
worth moving some code into a subroutine that handles only the upper relation.

Originally the function could only handle base relations, then join relation
was added, then one kind of upper relation (UPPERREL_GROUP_AGG) was added and
eventually the function will need to handle multiple kinds of upper
relation. It should be no surprise if such an amount of changes makes the
original signature insufficient.


I 100% agree with you on that point.

Best regards,
Etsuro Fujita




Re: New vacuum option to do only freezing

2019-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2019 at 8:27 AM Bossart, Nathan  wrote:
>
> On 2/28/19, 12:13 AM, "Masahiko Sawada"  wrote:
> > Attached the updated version patch. I've incorporated all review
> > comments I got and have changed the number of tuples being reported as
> > 'removed tuples'. With this option, tuples completely being removed is
> > only tuples marked as unused during HOT-pruning, other dead tuples are
> > left. So we count those tuples during HOT-pruning and reports it as
> > removed tuples.
>
> Thanks for the new patches.  Beyond the reloptions discussion, most of
> my feedback is wording suggestions.

Thank you for the comment!

>
> +  VACUUM removes dead tuples and prunes HOT-updated
> +  tuples chain for live tuples on table. If the table has any dead tuple
> +  it removes them from both table and indexes for re-use. With this
> +  option VACUUM doesn't completely remove dead tuples
> +  and disables removing dead tuples from indexes.  This is suitable for
> +  avoiding transaction ID wraparound (see
> +  ) but not sufficient for 
> avoiding
> +  index bloat. This option is ignored if the table doesn't have index.
> +  This cannot be used in conjunction with FULL
> +  option.
>
> There are a couple of small changes I would make.  How does something
> like this sound?
>
> VACUUM removes dead tuples and prunes HOT-updated tuple chains for
> live tuples on the table.  If the table has any dead tuples, it
> removes them from both the table and its indexes and marks the
> corresponding line pointers as available for re-use.  With this
> option, VACUUM still removes dead tuples from the table, but it
> does not process any indexes, and the line pointers are marked as
> dead instead of available for re-use.  This is suitable for
> avoiding transaction ID wraparound (see Section 24.1.5) but not
> sufficient for avoiding index bloat.  This option is ignored if
> the table does not have any indexes.  This cannot be used in
> conjunction with the FULL option.

Hmm, that's good idea but I'm not sure that user knows the word 'line
pointer' because it is used only at pageinspect document. I wonder if
the word 'item identifier' would rather be appropriate here because
this is used at at describing page layout(in storage.sgml). Thought?

>
> - * Returns the number of tuples deleted from the page and sets
> - * latestRemovedXid.
> + * Returns the number of tuples deleted from the page and set latestRemoveXid
> + * and increment nunused.
>
> I would say something like: "Returns the number of tuples deleted from
> the page, sets latestRemovedXid, and updates nunused."

Fixed.

>
> +   /*
> +* hasindex = true means two-pass strategy; false means one-pass. But 
> we
> +* always use the one-pass strategy when index vacuum is disabled.
> +*/
>
> I think the added sentence should make it more clear that hasindex
> will still be true when DISABLE_INDEX_CLEANUP is used.  Maybe
> something like:
>
> /*
>  * hasindex = true means two-pass strategy; false means one-pass
>  *
>  * If DISABLE_INDEX_CLEANUP is used, hasindex may still be true,
>  * but we'll always use the one-pass strategy.
>  */
>

Fixed.

> tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, 
> false,
> - 
>&vacrelstats->latestRemovedXid);
> + 
>&vacrelstats->latestRemovedXid,
> + 
>&tups_pruned);
>
> Why do we need a separate tups_pruned argument in heap_page_prune()?
> Could we add the result of heap_page_prune() to tups_pruned instead,
> then report the total number of removed tuples as tups_vacuumed +
> tups_pruned elsewhere?

Hmm, I thought that we should report only the number of tuples
completely removed but we already count the tulples marked as
redirected as tups_vacuumed. Let me summarize the fate of dead tuples.
I think we can roughly classify dead tuples as follows.

1. root tuple of HOT chain that became dead
2. root tuple of HOT chain that became redirected
3. other tupels of HOT chain that became unused
4. tuples that became dead after HOT pruning

The tuples of #1 through #3 either have only ItemIDs or have been
completely removed but tuples of #4 has its tuple storage because they
are not processed when HOT-pruning.

Currently tups_vacuumed counts all of them, nleft (=
vacrelstats->num_dead_tuples) counts #1 + #4. I think that the number
of removed tuples being reported would be #1 + #2 + #3. Or should we
use  #2 + #3 instead?

>
> +* If there are no indexes or we skip index vacuum then we 
> can vacuum
> +* the page right now instead of doing a second scan.
>
> How about:
>
> If there are no indexes or index cleanup is disab

Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Sun, 3 Mar 2019 at 01:34, Jim Finnerty  wrote:
> I agree with Tom's comment above - when the cost of the NOT IN is dominated
> by the cost of the outer scan (i.e. when the cardinality of the outer
> relation is large relative to the cardinality of the subquery), and if the
> inner cardinality is small enough to fit in memory, then the current
> implementation does an efficient hash lookup into an in-memory structure,
> and that's a very fast way to do the NOT IN.  It almost achieves the
> lower-bound cost of scanning the outer relation.  It can also parallelizes
> easily, whether or not we currently can do that.  In these cases, the
> current plan is the preferred plan, and we should keep it.

If you do the conversion to anti-join (without hacking at the join
quals and assuming no nulls are possible), then the planner can decide
what's best.  The planner may choose a hash join which is not hugely
different from a hashed subplan, however from the testing I've done
the Hash Join is a bit faster. I imagine there's been more motivation
over the years to optimise that over hashed subplans.  As far as I
know, there's no parallel query support for hashed subplans, but I
know there is for hash joins.  In short, I don't think it makes any
sense to not translate into an anti-join (when possible). I think the
best anti-join plan will always be a win over the subquery. The
planner could make a mistake of course, but that's a different issue.
We certainly don't consider keeping the subquery around for NOT
EXISTS.

> This is a case that we would want to avoid the transform, because when both
> the inner and outer are nullable and the outer is large and the inner is
> small, the transformed plan would Scan and Materialize the inner for each
> row of the outer row, which is very slow compared to the untransformed plan:
>
> slow case for the transformation: https://explain.depesz.com/s/0CBB

Well, that's because you're forcing the planner into a corner in
regards to the join condition. It has no choice but to nested loop
that join.

> However, if the inner is too large to fit into memory, then the transformed
> plan is faster on all of our other test cases, although our test cases are
> far from complete.  If the current solution supports parallel scan of the
> outer, for example, then PQ could have lower elapsed time than the non-PQ
> nested loop solution.

I'm having a little trouble understanding this.  From what I
understand the code adds an OR .. IS NULL clause to the join
condition. Is this still the case with what you've been testing here?
If so, I'm surprised to hear all your test cases are faster. If
there's an OR clause in the join condition then the planner has no
choice but to use a nested loop join, so it's very surprising that you
would find that faster with larger data sets.

Or does the code your testing implement this a different way? Perhaps
with some execution level support?

> Also, remember that the issue with the empty inner was just a bug that was
> the result of trying to do an additional optimization in the case where
> there is no WHERE clause in the subquery.  That bug has been fixed.  The
> general case transformation described in the base note produces the correct
> result in all cases, including the empty subquery case.

I'm not sure why lack of WHERE clause in the subquery counts for
anything here.  The results set from the subquery can be empty or not
empty with or without a WHERE clause.  The only way you'll know it's
empty during planning is if some gating qual says so, but that's yet
to be determined by the time the transformation should be done.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Problems with plan estimates in postgres_fdw

2019-03-05 Thread Etsuro Fujita

(2019/03/01 20:16), Antonin Houska wrote:

Etsuro Fujita  wrote:



Conversely, it appears that add_foreign_ordered_paths() added by the patchset
would generate such pre-sorted paths *redundantly* when the input_rel is the
final scan/join relation.  Will look into that.


Currently I have no idea how to check the plan created by FDW at the
UPPERREL_ORDERED stage, except for removing the sort from the
UPPERREL_GROUP_AGG stage as I proposed here:

https://www.postgresql.org/message-id/11807.1549564431%40localhost


I don't understand your words "how to check the plan created by FDW". 
Could you elaborate on that a bit more?


Best regards,
Etsuro Fujita




Re: Minimal logical decoding on standbys

2019-03-05 Thread tushar

On 03/04/2019 10:57 PM, Andres Freund wrote:

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).


Right, This parameter was enabled on both Master and slave.

Is someone able to reproduce this issue ?

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2019 at 3:46 AM Tomas Vondra
 wrote:
>
>
>
> On 3/4/19 6:40 AM, Masahiko Sawada wrote:
> > On Sat, Mar 2, 2019 at 5:27 AM Robert Haas  wrote:
> >>
> >> On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada  
> >> wrote:
> >>> WAL encryption will follow as an additional feature.
> >>
> >> I don't think WAL encryption is an optional feature.  You can argue
> >> about whether it's useful to encrypt the disk files in the first place
> >> given that there's no privilege boundary between the OS user and the
> >> database, but a lot of people seem to think it is and maybe they're
> >> right.  However, who can justify encrypting only SOME of the disk
> >> files and not others?  I mean, maybe you could justify not encryption
> >> the SLRU files on the grounds that they probably don't leak much in
> >> the way of interesting information, but the WAL files certainly do --
> >> your data is there, just as much as in the data files themselves.
> >>
> >
> > Agreed.
> >
> >> To be honest, I think there is a lot to like about the patches
> >> Cybertec has proposed.  Those patches don't have all of the fancy
> >> key-management stuff that you are proposing here, but maybe that
> >> stuff, if we want it, could be added, rather than starting over from
> >> scratch.  It seems to me that those patches get a lot of things right.
> >> In particular, it looked to me when I looked at them like they made a
> >> pretty determined effort to encrypt every byte that might go down to
> >> the disk.  It seems to me that you if you want encryption, you want
> >> that.
> >>
> >
> > Agreed. I think the patch lacks the key management stuff: 2-tier key
> > architecture and integration of postgres with key management systems.
> > I'd like to work together and can propose the patch of key management
> > stuff to the proposed patch.
> >
>
> Sounds like a plan. It'd be nice to come up with a unified version of
> those two patches, combining the good pieces from both.
>
> I wonder how other databases deal with key management? Surely we're not
> the first/only database that tries to do transparent encryption, so
> perhaps we could learn something from others? For example, do they use
> this 2-tier key architecture? How do they do key management? etc.
>
> I don't say we should copy from them, but it'd allow us to (a) avoid
> making the same mistakes and (b) build a solution the users are already
> somewhat familiar with.
>
> May I suggest creating a page on the PostgreSQL wiki, explaining the
> design and updating it as the discussion develops?

Understood. I've been researching transparent encryption of other
databases and been considering the architecture. I'll write down them
to the wiki.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Refactoring the checkpointer's fsync request queue

2019-03-05 Thread Thomas Munro
On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath  wrote:
> [v11 patch]

Thanks.  Hmm, something is wrong here because make check is
dramatically slower -- for example the "insert" test runs in ~8-13
seconds instead of the usual ~0.2 seconds according to Travis,
AppVeyor and my local FreeBSD system (note that fsync is disabled so
it's not that -- it must be bogus queue-related CPU?)

-- 
Thomas Munro
https://enterprisedb.com



RE: libpq debug log

2019-03-05 Thread Iwata, Aya
Hi everyone,

I appreciate all the helpful advice.

I agree to display message more clearly. I will follow your advice.

I would like to add timestamp per line and when command processing function 
start/end.
I think it is useful to know the application process start/end for diagnosis.
So I will implement like this;

2019-03-03 07:24:54.142 UTC   PQgetResult start
2019-03-03 07:24:54.143 UTC   < 'T' 35 1 "set_config" 0 #0 25 #65535 -1 #0
2019-03-03 07:24:54.144 UTC   PQgetResult end


> > But I still don't really see a need for different levels or whatever.
> > I mean, you either want a dump of all of the protocol traffic, or you
> > don't, I think.  Or maybe I am confused as to what the goal of all
> > this really is.
> 
> Yeah, me too.  But a lot of this detail would only be useful if you were 
> trying
> to diagnose something like a discrepancy between the server and libpq as to
> the width of some field.  And the number of users for that can be counted
> without running out of fingers.  I think what would be of use for a trace
> facility is as high-level a view as possible of the message contents.
> 
> Or, in other words: a large part of the problem with the existing PQtrace
> facility is that it *was* designed to help debug libpq itself, and that
> use-case is no longer very interesting.  We should assume that the library
> knows how to parse protocol messages.

Since I explained the reason in the previous email, I am copy-pasting it again 
here.

I think the purpose of the leveling is to provide an optional information for 
the user, 
which is useful for diagnosis during the performance deterioration.
When query delay happens, we want to know from which side(server or client) is 
the cause of it,
and then people want to know which process takes time.
I think the phase and time information are useful for diagnosis.
For example, when command processing function (ex. PQexec()) etc. start/end
and when client receive/send protocol messages.

So is it alright to add these information to the new/proposed PQtrace() default 
output?



Regards,
Aya Iwata




[Issue] Can't recompile cube extension as PGXS, utils/float.h is not installed

2019-03-05 Thread Siarhei Siniak
1. Currently, cube extension has CUBE_MAX_DIM set as 100.
A recent github issue. [1]
2. To compile a custom version of the extension off the tree requires:
```
   make -C contrib/custom_cube USE_PGXS=1
```
3. But utils/float.h required by cube.c and cubeparse.y is not installed.
It's not present in the latest release file [2],
nor being installed when running
make install when compiling from git.
4. Current workaround is to use
```
#include "../../src/include/utils/float.h"
```
in cube.c and cubeparse.y when compiling in git tree.

[1] https://github.com/postgres/postgres/pull/38
[2] https://github.com/postgres/postgres/archive/REL_11_2.tar.gz


Re: Optimization of some jsonb functions

2019-03-05 Thread David Steele

On 2/22/19 2:05 AM, Nikita Glukhov wrote:

Attached set of patches with some jsonb optimizations that were made during
comparison of performance of ordinal jsonb operators and jsonpath operators.


This patch was submitted just before the last commitfest for PG12 and 
seems to have potential for breakage.


I have updated the target to PG13.

Regards,
--
-David
da...@pgmasters.net



Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote
On 2019/03/05 9:50, Amit Langote wrote:
> I'll post the updated patches after diagnosing what
> I'm suspecting a memory over-allocation bug in one of the patches.  If you
> configure build with --enable-cassert, you'll see that throughput
> decreases as number of partitions run into many thousands, but it doesn't
> when asserts are turned off.

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.

Thanks,
Amit
From 1e793976d1affae8aa6fa1c835752bad530132b6 Mon Sep 17 00:00:00 2001
From: Amit 
Date: Sat, 2 Mar 2019 14:13:13 +0900
Subject: [PATCH v27 1/6] Build "other rels" of appendrel baserels in a
 separate step

Currently they're built in a stanza in build_simple_rel() which
builds the child rels in the same invocation of build_simple_rel
that's used to build the parent relation.  Since query hasn't
been processed to distribute restriction clauses to individual
baserels at this point, we cannot perform partition pruning before
adding the children.

Newly added add_other_rels_to_query() runs *after* query_planner has
distributed restriction clauses to base relations.  This will allow
us to use the clauses applied a given partitioned baserel to perform
partition pruning, and add other rels for only the unpruned
partitions.  Later patches will do that.
---
 src/backend/optimizer/path/allpaths.c  |   2 +-
 src/backend/optimizer/plan/initsplan.c |  60 ++--
 src/backend/optimizer/plan/planmain.c  |  15 +++--
 src/backend/optimizer/util/relnode.c   | 101 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/planmain.h   |   1 +
 6 files changed, 135 insertions(+), 46 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..8d8a8f17d5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1028,7 +1028,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 
/*
 * The child rel's RelOptInfo was already created during
-* add_base_rels_to_query.
+* add_other_rels_to_query.
 */
childrel = find_base_rel(root, childRTindex);
Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
diff --git a/src/backend/optimizer/plan/initsplan.c 
b/src/backend/optimizer/plan/initsplan.c
index 2afc3f1dfe..077d3203ba 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -20,6 +20,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
+#include "optimizer/inherit.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
@@ -30,6 +31,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/analyze.h"
+#include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -97,10 +99,11 @@ static void check_hashjoinable(RestrictInfo *restrictinfo);
  * jtnode.  Internally, the function recurses through the jointree.
  *
  * At the end of this process, there should be one baserel RelOptInfo for
- * every non-join RTE that is used in the query.  Therefore, this routine
- * is the only place that should call build_simple_rel with reloptkind
- * RELOPT_BASEREL.  (Note: build_simple_rel recurses internally to build
- * "other rel" RelOptInfos for the members of any appendrels we find here.)
+ * every non-join RTE that is specified in the query.  Therefore, this
+ * routine is the only place that should call build_simple_rel with
+ * reloptkind RELOPT_BASEREL.  (Note:  "other rel" RelOptInfos for the
+ * members of any appendrels we find here are built later when query_planner
+ * calls add_other_rels_to_query().)
  */
 void
 add_base_rels_to_query(PlannerInfo *root, Node *jtnode)
@@ -133,6 +136,55 @@ add_base_rels_to_query(PlannerInfo *root, Node *jtnode)
 (int) nodeTag(jtnode));
 }
 
+/*
+ * add_other_rels_to_query
+ *
+ *   Scan the query's jointree and for each base rels that is an appendrel,
+ *   create otherrel RelOptInfos of its children
+ *
+ * At the end of this process, there should be RelOptInfos for all relations
+ * that will be scanned by the query.
+ */
+void
+add_other_rels_to_query(PlannerInfo *root, Node *jtnode)
+{
+   if (jtnode == NULL)
+   return;
+   if (IsA(jtnode, RangeTblRef))
+   {
+   int varno = ((RangeTblRef *) 
jtnode)->rtindex;
+   RangeTblEntry *rte = rt_fetch(varno, root->parse->rtable);
+
+   /*
+* Only the parent subquery of a flattened UNION ALL and an 
inherited
+* table can have c

Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote
On 2019/03/04 19:38, Amit Langote wrote:
> 2. Defer inheritance expansion to add_other_rels_to_query().  Although the
> purpose of doing this is to perform partition pruning before adding the
> children, this patch doesn't change when the pruning occurs.  It deals
> with other issues that must be taken care of due to adding children during
> query_planner instead of during subquery_planner.  Especially,
> inheritance_planner now has to add the child target relations on its own.
> Also, delaying adding children also affects adding junk columns to the
> query's targetlist based on PlanRowMarks, because preprocess_targetlist
> can no longer finalize which junk columns to add for a "parent"
> PlanRowMark; that must be delayed until all child PlanRowMarks are added
> and their allMarkTypes propagated to the parent PlanRowMark.

I thought more on this and started wondering why we can't call
preprocess_targetlist() from query_planner() instead of from
grouping_planner()?  We don't have to treat parent row marks specially if
preprocess_targetlist() is called after adding other rels (and hence all
child row marks).  This will change the order in which expressions are
added to baserels targetlists and hence the order of expressions in their
Path's targetlist, because the expressions contained in targetlist
(including RETURNING) and other junk expressions will be added after
expressions referenced in WHERE clauses, whereas the order is reverse
today.  But if we do what we propose above, the order will be uniform for
all cases, that is, not one for regular table baserels and another for
inherited table baserels.

Thoughts?

Thanks,
Amit




Re: query logging of prepared statements

2019-03-05 Thread Arthur Zakirov

On 04.03.2019 21:31, Justin Pryzby wrote:

It wasn't intentional.  Find attached v3 patch which handles that case,
by removing the 2nd call to errdetail_execute() ; since it's otherwise unused,
so remove that function entirely.


Thank you.


Thanks for reviewing.  I'm also interested in discussion about whether this
change is undesirable for someone else for some reason ?  For me, the existing
output seems duplicative and "denormalized".  :)


I perfectly understand your use case. I agree, it is duplicated. But I 
think some people may want to see it at every EXECUTE, if they don't 
want to grep for the prepared statement body which was logged earlier.


I think it would be good to give possibility to configure this behavior. 
At first version of your patch you relied on log_error_verbosity GUC. 
I'm not sure that this variables is suitable for configuring visibility 
of prepared statement body in logs, because it sets more general 
behavior. Maybe it would be better to introduce some new GUC variable if 
the community don't mind.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Re: [PATCH] pgbench tap tests fail if the path contains a perl special character

2019-03-05 Thread Raúl Marín Rodríguez
> Marking the patch as either Ready for Committer or Waiting on Author
will help move things along.

Thanks for the heads-up, I've retested it again and it applies cleanly on
top of master and addresses the issue.

I've moved it to `Ready for Committer`.



-- 
Raúl Marín Rodríguez
carto.com



Re: New vacuum option to do only freezing

2019-03-05 Thread Kyotaro HORIGUCHI
Hello, I have some other comments.

At Mon, 4 Mar 2019 23:27:10 +, "Bossart, Nathan"  
wrote in <48410154-e6c5-4c07-8122-8d04e3bcd...@amazon.com>
> On 2/28/19, 12:13 AM, "Masahiko Sawada"  wrote:
> > Attached the updated version patch. I've incorporated all review
> > comments I got and have changed the number of tuples being reported as
> > 'removed tuples'. With this option, tuples completely being removed is
> > only tuples marked as unused during HOT-pruning, other dead tuples are
> > left. So we count those tuples during HOT-pruning and reports it as
> > removed tuples.
> 
> Thanks for the new patches.  Beyond the reloptions discussion, most of
> my feedback is wording suggestions.
> 
> +  VACUUM removes dead tuples and prunes HOT-updated
> +  tuples chain for live tuples on table. If the table has any dead tuple
> +  it removes them from both table and indexes for re-use. With this
> +  option VACUUM doesn't completely remove dead tuples
> +  and disables removing dead tuples from indexes.  This is suitable for
> +  avoiding transaction ID wraparound (see
> +  ) but not sufficient for 
> avoiding
> +  index bloat. This option is ignored if the table doesn't have index.
> +  This cannot be used in conjunction with FULL
> +  option.
> 
> There are a couple of small changes I would make.  How does something
> like this sound?
> 
> VACUUM removes dead tuples and prunes HOT-updated tuple chains for
> live tuples on the table.  If the table has any dead tuples, it
> removes them from both the table and its indexes and marks the
> corresponding line pointers as available for re-use.  With this
> option, VACUUM still removes dead tuples from the table, but it
> does not process any indexes, and the line pointers are marked as
> dead instead of available for re-use.  This is suitable for
> avoiding transaction ID wraparound (see Section 24.1.5) but not
> sufficient for avoiding index bloat.  This option is ignored if
> the table does not have any indexes.  This cannot be used in
> conjunction with the FULL option.
> 
> - * Returns the number of tuples deleted from the page and sets
> - * latestRemovedXid.
> + * Returns the number of tuples deleted from the page and set latestRemoveXid
> + * and increment nunused.
> 
> I would say something like: "Returns the number of tuples deleted from
> the page, sets latestRemovedXid, and updates nunused."
> 
> + /*
> +  * hasindex = true means two-pass strategy; false means one-pass. But we
> +  * always use the one-pass strategy when index vacuum is disabled.
> +  */
> 
> I think the added sentence should make it more clear that hasindex
> will still be true when DISABLE_INDEX_CLEANUP is used.  Maybe
> something like:
> 
> /*
>  * hasindex = true means two-pass strategy; false means one-pass
>  *
>  * If DISABLE_INDEX_CLEANUP is used, hasindex may still be true,
>  * but we'll always use the one-pass strategy.
>  */
> 
> tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, 
> false,
> - 
>&vacrelstats->latestRemovedXid);
> + 
>&vacrelstats->latestRemovedXid,
> + 
>&tups_pruned);
> 
> Why do we need a separate tups_pruned argument in heap_page_prune()?
> Could we add the result of heap_page_prune() to tups_pruned instead,
> then report the total number of removed tuples as tups_vacuumed +
> tups_pruned elsewhere?
> 
> +  * If there are no indexes or we skip index vacuum then we can 
> vacuum
> +  * the page right now instead of doing a second scan.
> 
> How about:
> 
> If there are no indexes or index cleanup is disabled, we can
> vacuum the page right now instead of doing a second scan.
> 
> + /*
> +  * Here, we have indexes but index vacuum is 
> disabled. We don't
> +  * vacuum dead tuples on heap but forget them 
> as we skip index
> +  * vacuum. The vacrelstats->dead_tuples could 
> have tuples which
> +  * became dead after checked at HOT-pruning 
> time but aren't marked
> +  * as dead yet. We don't process them because 
> it's a very rare
> +  * condition and the next vacuum will process 
> them.
> +  */
> 
> I would suggest a few small changes:
> 
> /*
>  * Here, we have indexes but index vacuum is disabled.  Instead of
>  * vacuuming the dead tuples on the heap, we just forget them.
>  *
>  * Note that vacrelstats->dead_tuples could include tuples which
>  * became dead after HOT-pruning but are not marke

Re: Fix memleaks and error handling in jsonb_plpython

2019-03-05 Thread Nikita Glukhov

On 05.03.2019 6:45, Michael Paquier wrote:


On Fri, Mar 01, 2019 at 05:24:39AM +0300, Nikita Glukhov wrote:

Unfortunately, contrib/jsonb_plpython still contain a lot of problems in error
handling that can lead to memory leaks:
  - not all Python function calls are checked for the success
  - not in all places PG exceptions are caught to release Python references
But it seems that this errors can happen only in OOM case.

Attached patch with the fix. Back-patch for PG11 is needed.

That looks right to me.  Here are some comments.

One thing to be really careful of when using PG_TRY/PG_CATCH blocks is
that variables modified in the try block and then referenced in the
catch block need to be marked as volatile.  If you don't do that, the
value when reaching the catch part is indeterminate.

With your patch the result variable used in two places of
PLyObject_FromJsonbContainer() is not marked as volatile.  Similarly,
it seems that "items" in PLyMapping_ToJsonbValue() and "seq" in
"PLySequence_ToJsonbValue" should be volatile because they get changed
in the try loop, and referenced afterwards.


I known about this volatility issues, but maybe I incorrectly understand what
should be marked as volatile for pointer variables: the pointer itself and/or
the memory referenced by it.  I thought that only pointer needs to be marked,
and also there is message [1] clearly describing what needs to be marked.


Previously in PLyMapping_ToJsonbValue() the whole contents of PyObject was
marked as volatile, not the pointer itself which is not modified in PG_TRY:

-   /* We need it volatile, since we use it after longjmp */
-   volatile PyObject *items_v = NULL;

So, I removed volatile qualifier here.

Variable 'result' is also not modified in PG_TRY, it is also non-volatile.


I marked only 'key' variable in PLyObject_FromJsonbContainer() as volatile,
because it is really modified in the loop inside PG_TRY(), and
PLyObject_FromJsonbValue(&v2) call after its assignment can throw PG
exception:
+  PyObject   *volatile key = NULL;


Also I have idea to introduce a global list of Python objects that need to be
dereferenced in PG_CATCH inside PLy_exec_function() in the case of exception.
Then typical code will be look like that:

  PyObject *list = PLy_RegisterObject(PyList_New());

  if (!list)
return NULL;

  ... code that can throw PG exception, PG_TRY/PG_CATCH is not needed ...

  return PLy_UnregisterObject(list); /* returns list */


Another issue: in ltree_plpython we don't check the return state of
PyList_SetItem(), which we should complain about I think.


Yes, PyList_SetItem() and PyString_FromStringAndSize() should be checked,
but CPython's PyList_SetItem() really should not fail because list storage
is preallocated:

int
PyList_SetItem(PyObject *op, Py_ssize_t i, PyObject *newitem)
{
PyObject **p;
if (!PyList_Check(op)) {
Py_XDECREF(newitem);
PyErr_BadInternalCall();
return -1;
}
if (!valid_index(i, Py_SIZE(op))) {
Py_XDECREF(newitem);
PyErr_SetString(PyExc_IndexError,
"list assignment index out of range");
return -1;
}
p = ((PyListObject *)op) -> ob_item + i;
Py_XSETREF(*p, newitem);
return 0;
}


[1] https://www.postgresql.org/message-id/31436.1483415248%40sss.pgh.pa.us

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Tue, Mar 5, 2019 at 11:35 AM Tom Lane  wrote:
> True.  I've spent some time today running the ssl tests on various
> machines here, without any luck reproducing.

BTW, I went looking for other failures on the buildfarm I noticed that
even for eelpout it's only happening on master and REL_11_STABLE:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=200&branch=&stage=sslCheck&filter=Submit

Disappointingly, that turned out to be just because 10 and earlier
didn't care what the error message said.

-- 
Thomas Munro
https://enterprisedb.com



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-05 Thread Heikki Linnakangas
I'm looking at the first patch in the series now. I'd suggest that you 
commit that very soon. It's useful on its own, and seems pretty much 
ready to be committed already. I don't think it will be much affected by 
whatever changes we make to the later patches, anymore.


I did some copy-editing of the code comments, see attached patch which 
applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. 
Mostly, to use more Plain English: use active voice instead of passive, 
split long sentences, avoid difficult words.


I also had a few comments and questions on some details. I added them in 
the same patch, marked with "HEIKKI:". Please take a look.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89a..eb4df2ebbe6 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -609,6 +609,9 @@ original search scankey is consulted as each index entry is sequentially
 scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
+HEIKKI: The above probably needs some updating, now that we have a
+separate BTScanInsert struct to represent an insertion scan key.
+
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
 all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b3fbba276dd..2a2d6576060 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -97,9 +97,12 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.  If rel is a unique index, then every call
- *		here is a checkingunique call (i.e. every call does a duplicate
- *		check, though perhaps only a tentative check).
+ *		don't actually insert.
+
+HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
+weird to talk about it in the function comment. I didn't understand what
+the point of adding this sentence was, so I removed it.
+
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -285,9 +288,10 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
 		/*
-		 * Do the insertion.  Note that itup_key contains mutable state used
-		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
-		 * binary search.  !checkingunique case must start own binary search.
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 	   itup, stack, heapRel);
@@ -311,10 +315,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
- * reuse most of the work of our initial binary search to find conflicting
- * tuples.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -326,6 +326,10 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in itup_key that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
 _bt_check_unique(Relation rel, BTScanInsert itup_key,
@@ -352,8 +356,8 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	/*
-	 * Save binary search bounds.  Note that this is also used within
-	 * _bt_findinsertloc() later.
+	 * Save binary search bounds.  We use them in the fastpath below, but
+	 * also in the _bt_findinsertloc() call later.
 	 */
 	itup_key->savebinsrch = true;
 	offset = _bt_binsrch(rel, itup_key, buf);
@@ -375,16 +379,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 		if (offset <= maxoff)
 		{
 			/*
-			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
-			 * consideration to items that are definitely duplicates in most
-			 * cases (though not when original page is empty, or when initial
-			 * offset is past the end of the original page, which may indicate
-			 * that we'll have to examine a second or subsequent page).
+			 *

Re: Re: Patch for SortSupport implementation on inet/cdir

2019-03-05 Thread David Steele

On 2/16/19 4:13 AM, Andres Freund wrote:


On 2019-02-09 20:12:53 +1300, Edmund Horner wrote:

I had a look at this.  Your V2 patch applies cleanly, and the code was
straightforward and well commented.  I appreciate the big comment at the
top of network_abbrev_convert explaining how you encode the data.


I've marked the CF entry as waiting-on-author, due to this review.

Brandur, unfortunately this patch has only been submitted to the last
commitfest for v12. Our policy is that all nontrivial patches should be
submitted to at least one earlier commitfest. Therefore I unfortunately
think that v12 is not a realistic target, sorry :(


I agree, and since there has been no response from the author I have 
pushed the target version to PG13.


Regards,
--
-David
da...@pgmasters.net



Re: Prevent extension creation in temporary schemas

2019-03-05 Thread Chris Travers
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I ran make checkworld and everything passed.

I tried installing a test extension into a temp schema.  I found this was 
remarkably difficult to do because pg_temp did not work (I had to create a 
temporary table and then locate the actual table it was created in).  While 
that might also be a bug it is not in the scope of this patch so mostly noting 
in terms of future work.

After creating the extension I did as follows:
\dx in the current session shows the extension
\dx in a stock psql shows the extension in a separate session
\dx with a patched psql in a separate session does not show the extension.

In terms of the scope of this patch, I think this correctly and fully solves 
the problem at hand.

The new status of this patch is: Ready for Committer


Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Tomas Vondra
On 3/5/19 1:14 AM, Peter Geoghegan wrote:
> On Mon, Feb 25, 2019 at 8:48 AM Robert Haas  wrote:
>> +1 for raising the default substantially.  In my experience, and it
>> seems others are in a similar place, nobody ever gets into trouble
>> because the default is too high, but sometimes people get in trouble
>> because the default is too low.
> 
> Does anyone want to make an argument against the idea of raising the
> default? They should speak up now.
> 

I don't know.

On the one hand I don't feel very strongly about this change, and I have
no intention to block it (because in most cases I do actually increase
the value anyway). I wonder if those with small systems will be happy
about it, though.

But on the other hand it feels a bit weird that we increase this one
value and leave all the other (also very conservative) defaults alone.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-05 Thread Tomas Vondra
On 3/5/19 4:12 AM, Michael Paquier wrote:
> On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
>> I still don't understand what issue you see in how basebackup verifies
>> checksums. Can you point me to the explanation you've sent after 11 was
>> released?
> 
> The history is mostly on this thread:
> https://www.postgresql.org/message-id/20181020044248.gd2...@paquier.xyz
> 

Thanks, will look.

Based on quickly skimming that thread the main issue seems to be
deciding which files in the data directory are expected to have
checksums. Which is a valid issue, of course, but I was expecting
something about partial read/writes etc.

>> So you have a workload/configuration that actually results in data
>> corruption yet we fail to detect that? Or we generate false positives?
>> Or what do you mean by "100% safe" here?
> 
> What's proposed on this thread could generate false positives.  Checks
> which have deterministic properties and clean failure handling are
> reliable when it comes to reports.

My understanding is that:

(a) The checksum verification should not generate false positives (same
as for basebackup).

(b) The partial reads do emit warnings, which might be considered false
positives I guess. Which is why I'm arguing for changing it to do the
same thing basebackup does, i.e. ignore this.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GiST VACUUM

2019-03-05 Thread Heikki Linnakangas

On 05/03/2019 02:26, Andrey Borodin wrote:
I also tried your amcheck tool with this. It did not report any 
errors.


Attached is also latest version of the patch itself. It is the
same as your latest patch v19, except for some tiny comment
kibitzing. I'll mark this as Ready for Committer in the commitfest
app, and will try to commit it in the next couple of days.


That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.


Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one 
little thing that we missed earlier: the check for "invalid tuples" that 
might be left over in pg_upgraded pre-9.1 indexes, was lost at some 
point. I added that check back. (It would be nice if GiST indexes had a 
metadata page with a version number, so we could avoid that work in the 
99% of cases that that check is not needed, but that's a different story.)


I'll change the status of this patch to "Waiting on Author", to reflect 
the state of the 2nd patch, since you're working on the radix tree 
blockset stuff.


- Heikki



Re: Patch to document base64 encoding

2019-03-05 Thread Karl O. Pinc
Hi Fabien,

On Tue, 5 Mar 2019 07:09:01 +0100 (CET)
Fabien COELHO  wrote:

> > Doc patch, against master.  Documents encode() and decode() base64 
> > format.  
> 
> It is already documented. Enhance documentation, though.

Right.  I was thinking that there are various implementations
of the base64 data format and so it needed more than
just to be named.

> > Attached: doc_base64_v1.patch
> >
> > References RFC2045 section 6.8 to define base64.  
> 
> Did you consider referencing RFC 4648 instead?

Not really.  What drew me to document was the line
breaks every 76 characters.  So I pretty much went
straight to the MIME RFC which says there should
be breaks at 76 characters.

I can see advantages and disadvantages either way.
More or less extraneous information either semi
or not base64 related in either RFC.
Which RFC do you think should be referenced?

Attached: doc_base64_v2.patch

This new version adds a phrase clarifying that
decode errors are raised when trailing padding
is wrong.  Seemed like I may as well be explicit.

(I am not entirely pleased with the double dash
but can't come up with anything better.  And
can't make an emdash entity work either.)

Thanks for taking a look.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..95b8f5cd5a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+ base64
+
 decode(string text,
 format text)

@@ -1769,13 +1772,16 @@
 
  encode
 
+
+ base64
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
+formats are: base64, hex, escape.
 escape converts zero bytes and high-bit-set bytes to
 octal sequences (\nnn) and
 doubles backslashes.
@@ -2365,6 +2371,21 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64
+   
+
+   
+ The base64 encoding of the encode
+ and decode functions is that of RFC2045 section 6.8.
+ As per the RFC, encoded lines are broken at 76 characters.  However
+ instead of the MIME CRLF end-of-line marker, only a newline is used for
+ end-of-line.  The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is raised
+ when decode is supplied invalid base64 data --
+ including when trailing padding is incorrect.
+   
+

See also the aggregate function string_agg in
.
@@ -3577,13 +3598,16 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
+   formats are: base64, hex, escape.
escape converts zero bytes and high-bit-set bytes to
octal sequences (\nnn) and
doubles backslashes.


Re: RE: libpq debug log

2019-03-05 Thread David Steele

On 3/5/19 11:48 AM, Iwata, Aya wrote:


So is it alright to add these information to the new/proposed PQtrace() default 
output?


I agree with Andres [1] that it's not very clear where this patch is  
going and we should push the target to PG13.


Regards,
--
-David
da...@pgmasters.net

[1]  
https://www.postgresql.org/message-id/raw/20190214203752.t4hl574k6jlu4t25%40alap3.anarazel.de




Re: Re: proposal: new polymorphic types - commontype and commontypearray

2019-03-05 Thread David Steele

On 2/4/19 4:21 AM, Michael Paquier wrote:

On Wed, Jan 30, 2019 at 05:08:03PM +0100, Pavel Stehule wrote:

maybe "supertype". It is one char shorter .. somewhere is term
"supperclass, ..."

In Czech language this term is short, "nadtyp", but probably it is not
acceptable :)


Moved to next CF.


This thread has been very quiet for a month.  I agree with Andres [1] 
that we should push this to PG13.


--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/raw/20190214203752.t4hl574k6jlu4t25%40alap3.anarazel.de 





Re: Patch to document base64 encoding

2019-03-05 Thread Karl O. Pinc
On Tue, 5 Mar 2019 07:26:17 -0600
"Karl O. Pinc"  wrote:

> (I am not entirely pleased with the double dash
> but can't come up with anything better.  And
> can't make an emdash entity work either.)

Attached: doc_base64_v3.patch

There is an mdash entity.  This patch uses that.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..e5ae322faa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+ base64
+
 decode(string text,
 format text)

@@ -1769,13 +1772,16 @@
 
  encode
 
+
+ base64
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
+formats are: base64, hex, escape.
 escape converts zero bytes and high-bit-set bytes to
 octal sequences (\nnn) and
 doubles backslashes.
@@ -2365,6 +2371,21 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64
+   
+
+   
+ The base64 encoding of the encode
+ and decode functions is that of RFC2045 section 6.8.
+ As per the RFC, encoded lines are broken at 76 characters.  However
+ instead of the MIME CRLF end-of-line marker, only a newline is used for
+ end-of-line.  The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is raised
+ when decode is supplied invalid base64 data —
+ including when trailing padding is incorrect.
+   
+

See also the aggregate function string_agg in
.
@@ -3577,13 +3598,16 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
+   formats are: base64, hex, escape.
escape converts zero bytes and high-bit-set bytes to
octal sequences (\nnn) and
doubles backslashes.


Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-05 Thread Amit Langote
Hi,

(cc'ing -hackers and Peter E)

On Tue, Mar 5, 2019 at 8:02 PM PG Bug reporting form
 wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:  15668
> Logged by:  Alexander Lakhin
> Email address:  exclus...@gmail.com
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Ubuntu 18.04
> Description:
>
> The following query:
> CREATE TABLE range_parted (a int) PARTITION BY RANGE (a);
> CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
> (unknown.unknown) TO (1);
>
> crashes server (on the master branch) with the stack trace:
> Core was generated by `postgres: law regression [local] CREATE TABLE
> '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x560ab19ea0bc in transformPartitionRangeBounds
> (pstate=pstate@entry=0x560ab3290da8, blist=,
> parent=parent@entry=0x7f7846a6bea8) at parse_utilcmd.c:3754
> 3754if (strcmp("minvalue", cname) == 0)

Thanks for the report.  Seems to be a bug of the following commit in
HEAD, of which I was one of the authors:

commit 7c079d7417a8f2d4bf5144732e2f85117db9214f
Author: Peter Eisentraut 
Date:   Fri Jan 25 11:27:59 2019 +0100

Allow generalized expression syntax for partition bounds

That seems to be caused by some over-optimistic coding in
transformPartitionRangeBounds.  Following will crash too.

CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
(a.a.a.a.a.a.a.a.a.a.a.a) TO (1);

If I try the list partitioning syntax, it doesn't crash but gives the
following error:

create table lparted1 partition of lparted for values in (a.a.a.a.a.a);
ERROR:  improper qualified name (too many dotted names): a.a.a.a.a.a
LINE 1: ...able lparted1 partition of lparted for values in (a.a.a.a.a
 ^
Maybe we should error out as follows in
transformPartitionRangeBounds(), although that means we'll get
different error message than when using list partitioning syntax:

@@ -3749,6 +3749,12 @@ transformPartitionRangeBounds(ParseState
*pstate, List *blist,
if (list_length(cref->fields) == 1 &&
IsA(linitial(cref->fields), String))
cname = strVal(linitial(cref->fields));
+   else
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+errmsg("invalid expression for range bound"),
+parser_errposition(pstate,
+   exprLocation((Node *) expr;

Assert(cname != NULL);
if (strcmp("minvalue", cname) == 0)

Thanks,
Amit



Re: Re: proposal: new polymorphic types - commontype and commontypearray

2019-03-05 Thread Pavel Stehule
út 5. 3. 2019 v 14:38 odesílatel David Steele  napsal:

> On 2/4/19 4:21 AM, Michael Paquier wrote:
> > On Wed, Jan 30, 2019 at 05:08:03PM +0100, Pavel Stehule wrote:
> >> maybe "supertype". It is one char shorter .. somewhere is term
> >> "supperclass, ..."
> >>
> >> In Czech language this term is short, "nadtyp", but probably it is not
> >> acceptable :)
> >
> > Moved to next CF.
>
> This thread has been very quiet for a month.  I agree with Andres [1]
> that we should push this to PG13.
>

ok

Pavel


> --
> -David
> da...@pgmasters.net
>
> [1]
>
> https://www.postgresql.org/message-id/raw/20190214203752.t4hl574k6jlu4t25%40alap3.anarazel.de
>
>


Re: Rare SSL failures on eelpout

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> BTW, I went looking for other failures on the buildfarm I noticed that
> even for eelpout it's only happening on master and REL_11_STABLE:

Yeah, I'd noticed that.

> Disappointingly, that turned out to be just because 10 and earlier
> didn't care what the error message said.

That is, you can reproduce the failure on old branches?  That lets
out a half-theory I'd had, which was that Andres' changes to make
the backend always run its socket in nonblock mode had had something
to do with it.  (Those changes do represent a plausible reason why
SSL_shutdown might be returning WANT_READ/WANT_WRITE; but I'm not
in a hurry to add such code without evidence that it actually
happens and something useful would change if we retry.)

regards, tom lane



Re: proposal: new polymorphic types - commontype and commontypearray

2019-03-05 Thread Tom Lane
David Steele  writes:
> This thread has been very quiet for a month.  I agree with Andres [1] 
> that we should push this to PG13.

I think the main thing it's blocked on is disagreement on what the
type name should be, which is kind of a silly thing to get blocked on,
but nonetheless it's important ...

regards, tom lane



Re: NOT IN subquery optimization

2019-03-05 Thread Tom Lane
David Steele  writes:
> I'm not sure if I have an issue with competing patches on the same 
> thread.  I've seen that before and it can lead to a good outcome.  It 
> case, as you say, also lead to confusion.

It's a bit of a shame that the cfbot will only be testing one of them
at a time if we leave it like this.  I kind of lean towards the
two-thread, two-CF-entry approach because of that.  The amount of
confusion is a constant.

regards, tom lane



Re: [Issue] Can't recompile cube extension as PGXS, utils/float.h is not installed

2019-03-05 Thread Tom Lane
Siarhei Siniak  writes:
> 3. But utils/float.h required by cube.c and cubeparse.y is not installed.

AFAICT, that file only exists in HEAD, not in any released branch, and
it is installed during "make install" from HEAD.  Please be sure you
are using installed files that match whatever branch you're trying
to build from.

regards, tom lane



Re: Fsync-before-close thought experiment

2019-03-05 Thread Robert Haas
On Sun, Mar 3, 2019 at 6:31 PM Thomas Munro  wrote:
> A more obvious approach that probably moves us closer to the way
> kernel developers expect us to write programs is to call fsync()
> before close() (due to vfd pressure) if you've written.

Interesting

> The obvious
> problem with that is that you could finish up doing loads more
> fsyncing than we're doing today if you're regularly dirtying more than
> max_files_per_process in the same backend.

Check.

> I wonder if we could make
> it more acceptable like so:
>
> * when performing writes, record the checkpointer's cycle counter in the File
>
> * when closing due to vfd pressure, only call fsync() if the cycle
> hasn't advanced (that is, you get to skip the fsync() if you haven't
> written in this sync cycle, because the checkpointer must have taken
> care of your writes from the previous cycle)

Hmm, OK.

> * if you find you're doing this too much (by default, dirtying more
> than 1k relation segments per checkpoint cycle), maybe log a warning
> that the user might want to consider increasing max_files_per_process
> (and related OS limits)
>
> A possible improvement, stolen from the fd-passing patch, is to
> introduce a "sync cycle", separate from checkpoint cycle, so that the
> checkpointer can process its table of pending operations more than
> once per checkpoint if that would help minimise fsyncing in foreground
> processes.  I haven't thought much about how exactly that would work.

Yeah, that seems worth considering.  I suppose that a backend could
keep track of how many times it's recorded the current sync cycle in a
File that is still open -- this seems like it should be pretty simple
and cheap, provided we can find the right places to put the counter
adjustments.  If that number gets too big, like say greater than 80%
of the number of fds, it sends a ping to the checkpointer.  I'm not
sure if that would then immediately trigger a full sync cycle or if
there is something more granular we could do.

> There is a less obvious problem with this scheme though:
>
> 1.  Suppose the operating system has a single error flag for a file no
> matter how many fds have it open, and it is cleared by the first
> syscall to return EIO.  There is a broken schedule like this (B =
> regular backend, C = checkpointer):
>
> B: fsync() -> EIO # clears error flag
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> 2.  On an operating system that has an error counter + seen flag
> (Linux 4.14+), in general you receive the error in all fds, which is a
> very nice property, but we'd still have a broken schedule involving
> the seen flag:
>
> B: fsync() -> EIO # clears seen flag
> C: open() # opened after error
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> Here's one kind of interlocking that might work:  Hash pathnames and
> map to an array of lwlocks + error flags.  Any process trying to sync
> a file must hold the lock and check for a pre-existing error flag.
> Now a checkpoint cannot succeed if any backend has recently decided to
> panic.  You could skip that if data_sync_retry = on.

That might be fine, but I think it might be possible to create
something more light-weight.  Suppose we just decide that foreground
processes always win and the checkpointer has to wait before logging
the checkpoint.  To that end, a foreground process advertises in
shared memory whether or not it is currently performing an fsync.  The
checkpointer must observe each process until it sees that process in
the not-fsyncing state at least once.

If a process starts fsync-ing after being observed not-fsyncing, it
just means that the backend started doing an fsync() after the
checkpointer had completed the fsyncs for that checkpoint.  And in
that case the checkpointer would have observed the EIO for any writes
prior to the checkpoint, so it's OK to write that checkpoint; it's
only the next one that has an issue, and the fact that we're now
advertising that we are fsync()-ing again will prevent that one from
completing before we emit any necessary PANIC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Index Skip Scan

2019-03-05 Thread Dmitry Dolgov
> On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes  wrote:
>
> This version of the patch can return the wrong answer.

Yes, indeed. In fact it answers my previous question related to the backward
cursor scan, when while going back we didn't skip enough. Within the current
approach it can be fixed by proper skipping for backward scan, something like
in the attached patch.

Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.

> If we accept this patch, I hope it would be expanded in the future to give
> similar performance as the above query does even when the query is written in
> its more natural way of:

Yeah, I hope the current approach with a new index am routine can be extended
for that.

> Without the patch, instead of getting a wrong answer, I get an error:

Right, as far as I can see without a skip scan and SCROLL, a unique + index
scan is used, where amcanbackward is false by default. So looks like it's not
really patch related.


v10-0001-Index-skip-scan.patch
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-05 Thread Robert Haas
On Mon, Mar 4, 2019 at 1:01 AM Masahiko Sawada  wrote:
> I think that there is no need to use the same key for both the spill
> files and WAL because only one process encrypt/decrypt spill files. We
> can use something like temporary key for that use case, which is used
> by only one process and lives during process lifetime (or transaction
> lifetime). The same is true for for other temporary files such as
> tuplesort and tuplestore, although maybe we need tricks for shared
> tuplestore.

Agreed.  For a shared tuplestore you need a key that is shared between
the processes involved, but it doesn't need to be the same as any
other key.  For anything that is accessed by only a single process,
that process can just generate any old key and, as long as it's
secure, it's fine.

For the WAL, you could potentially create a new WAL record type that
is basically an encrypted wrapper around another WAL record.  So if
table X is encrypted with key K1, then all of the WAL records for
table X are wrapped inside of an encrypted-record WAL record that is
encrypted with key K1.  That's useful for people who want fine-grained
encryption only of certain data.

But for people who want to just encrypt everything, you need to
encrypt the entire WAL stream, all SLRU data, etc. and that pretty
much all has to be one key (or sub-keys derived from that one key
somehow).

> > Or what do you do
> > about SLRUs or other global structures?  If you just exclude that
> > stuff from the scope of encryption, then you aren't helping the people
> > who want to Just Encrypt Everything.
>
> Why do people want to just encrypt everything? For satisfying some
> security compliance?

Yeah, I think so.  Perhaps an encrypted filesystem is a better way to
go, but some people want something that is built into the database
server.  The motivation seems to be mostly that they have a compliance
requirement -- either the database itself encrypts everything, or they
cannot use the software.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 7:53 AM Tomas Vondra
 wrote:
> But on the other hand it feels a bit weird that we increase this one
> value and leave all the other (also very conservative) defaults alone.

Are you talking about vacuum-related defaults or defaults in general?
In 2014, we increased the defaults for work_mem and
maintenance_work_mem by 4x and the default for effective_cache_size by
32x; in 2012, we increased the default for shared_buffers from by 4x.
It's possible some of those parameters should be further increased at
some point, but deciding not to increase any of them until we can
increase all of them is tantamount to giving up on changing anything
at all.  I think it's OK to be conservative by default, but we should
increase parameters where we know that the default is likely to be too
conservative for 99% of users.  My only question about this change is
whether to go for a lesser multiple (e.g. 4x) rather than the proposed
10x.  But I think even if 10x turns out to be too much for a few more
people than we'd like, we're still going to be better off increasing
it and having some people have to turn it back down again than leaving
it the way it is and having users regularly suffer vacuum-starvation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



any plan to support shared servers like Oracle in PG?

2019-03-05 Thread Andy Fan
currently there is one process per connection and it will not not very good
for some short time connection.In oracle database, it support shared
server which can serve more than 1 users  at the same time.

See
https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc001.htm#ADMIN11166


do we have any plan about this?


Re: Delay locking partitions during query execution

2019-03-05 Thread Tomas Vondra



On 3/5/19 6:55 AM, David Rowley wrote:
> On Sat, 2 Feb 2019 at 02:52, Robert Haas  wrote:
>> I think the key question here is whether or not you can cope with
>> someone having done arbitrary AEL-requiring modifications to the
>> delaylocked partitions.  If you can, the fact that the plan might
>> sometimes be out-of-date is an inevitable consequence of doing this at
>> all, but I think we can argue that it's an acceptable consequence as
>> long as the resulting behavior is not too bletcherous.  If you can't,
>> then this patch is dead.
> 
> I spent some time looking at this patch again and thinking about the
> locking. I ended up looking through each alter table subcommand by way
> of AlterTableGetLockLevel() and going through scenarios with each of
> them in my head to try to understand:
> 
> a) If the subcommand can even be applied to a leaf partition; and
> b) Can the subcommand cause a cached plan to become invalid?
> 
> I did all the easy ones first then started on the harder ones. I came
> to AT_DropConstraint and imagined the following scenario:
> 
> -- ** session 1
> create table listp (a int) partition by list(a);
> create table listp1 partition of listp for values in(1);
> create table listp2 partition of listp for values in(2);
> 
> create index listp1_a_idx on listp1 (a);
> create index listp2_a_idx on listp2 (a);
> 
> set enable_seqscan = off;
> set plan_cache_mode = force_generic_plan;
> prepare q1 (int) as select * from listp where a = $1;
> execute q1 (1);
> begin;
> execute q1 (1);
> 
> 
> -- ** session 2
> drop index listp2_a_idx;
> 
> -- ** session 1
> execute q1 (2);
> ERROR:  could not open relation with OID 16401
> 
> The only way I can think to fix this is to just never lock partitions
> at all, and if a lock is to be obtained on a partition, it must be
> instead obtained on the top-level partitioned table.  That's a rather
> large change that could have large consequences, and I'm not even sure
> it's possible since we'd need to find the top-level parent before
> obtaining the lock, by then the hierarchy might have changed and we'd
> need to recheck, which seems like quite a lot of effort just to obtain
> a lock... Apart from that, it's not this patch, so looks like I'll
> need to withdraw this one :-(
> 

So you're saying we could

1) lookup the parent and lock it
2) repeat the lookup to verify it did not change

I think that could still be a win, assuming that most hierarchies will
be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and
4 levels would be 100% in practice). And the second lookup should be
fairly cheap thanks to syscache and the fact that the hierarchies do not
change very often.

I can't judge how invasive this patch would be, but I agree it's more
complex than the originally proposed patch.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: speeding up planning with partitions

2019-03-05 Thread Jesper Pedersen

On 3/5/19 5:24 AM, Amit Langote wrote:

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.



Thanks !

I'm seeing the throughput going down as well, but are you sure it isn't 
just the extra calls of MemoryContextCheck you are seeing ? A flamegraph 
diff highlights that area -- sent offline.


A non cassert build shows the same profile for 64 and 1024 partitions.

Best regards,
 Jesper



Re: Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:00 AM Etsuro Fujita
 wrote:
> apply_projection_to_path() not only jams the given tlist into the
> existing path but updates its tlist eval costs appropriately except for
> the cases of Gather and GatherMerge:

I had forgotten that detail, but I don't think it changes the basic
picture.  Once you've added a bunch of Paths to a RelOptInfo, it's too
late to change their *relative* cost, because add_path() puts the list
in a certain order, and adjusting the path costs won't change that
ordering.  You've got to have the costs already correct at the time
add_path() is first called for any given Path.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:56 AM Tatsuro Yamada
 wrote:
> >> === Discussion points ===
> >>
> >>- Progress counter for "3. sorting tuples" phase
> >>   - Should we add pgstat_progress_update_param() in tuplesort.c like a
> >> "trace_sort"?
> >> Thanks to Peter Geoghegan for the useful advice!
> >
> > How would we avoid an abstraction violation?
>
> Hmm... What do you mean an abstraction violation?
> If it is difficult to solve, I'd not like to add the progress counter for the 
> sorting tuples.

What I mean is... I think it would be useful to have this counter, but
I'm not sure how the tuplesort code would know to update the counter
in this case and not in other cases.  The tuplesort code is used for
lots of things; we can't update a counter for CLUSTER if the tuplesort
is being used for CREATE INDEX or a Sort node in a query or whatever.
So my question is how we would indicate to the tuplesort that it needs
to do the counter update, and whether that would end up making for
ugly code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 3:33 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Disappointingly, that turned out to be just because 10 and earlier
> > didn't care what the error message said.
>
> That is, you can reproduce the failure on old branches?  That lets
> out a half-theory I'd had, which was that Andres' changes to make
> the backend always run its socket in nonblock mode had had something
> to do with it.  (Those changes do represent a plausible reason why
> SSL_shutdown might be returning WANT_READ/WANT_WRITE; but I'm not
> in a hurry to add such code without evidence that it actually
> happens and something useful would change if we retry.)

Yes, on REL_10_STABLE:

$ for i in `seq 1 1000 ` ; do
psql "host=localhost port=56024 dbname=certdb user=postgres
sslcert=ssl/client-revoked.crt sslkey=ssl/client-revoked.key"
  done
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
...
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
could not send startup packet: Connection reset by peer
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked

Zooming in with strace:

sendto(3, 
"\27\3\3\2\356\r\214\352@\21\320\202\236}\376\367\262\227\177\255\212\204`q\254\108\326\201+c)"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, 
"\27\3\3\0cW_\210\337Q\227\360\216k\221\346\372pw\27\325P\203\357\245km\304Rx\355\200"...,
104, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)

You can see that poll() already knew the other end had closed the
socket.  Since this is clearly timing... let's see, yeah, I can make
it fail every time by adding sleep(1) before the comment "Send the
startup packet.".  I assume that'll work on any Linux machine?

To set this test up, I ran a server with the following config:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'

I copied those files out of src/test/ssl/ssl/.  Then I ran the psql
command shown earlier.  I think I had to chmod 600 the keys.

-- 
Thomas Munro
https://enterprisedb.com



Re: Refactoring the checkpointer's fsync request queue

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath  wrote:
> Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
> Interesting! It's reproducible so should be able to figure out what's
> going on. The only thing we do in ForwardSyncRequest() is split up the 8
> bits into 2x4 bits and copy the FileTagData structure to the
> checkpointer queue. Will report back what I find.

More review, all superficial stuff:

+typedef struct
+{
+   RelFileNode rnode;
+   ForkNumber  forknum;
+   SegmentNumber   segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;

Even though I know I said we should take FileTag by pointer, and even
though there is an older tradition in the tree of having a struct
named "FooData" and a corresponding pointer typedef named "Foo", as
far as I know most people are not following the convention for new
code and I for one don't like it.  One problem is that there isn't a
way to make a pointer-to-const type given a pointer-to-non-const type,
so you finish up throwing away const from your programs.  I like const
as documentation and a tiny bit of extra compiler checking.  What do
you think about "FileTag" for the struct and eg "const FileTag *tag"
when receiving one as a function argument?

-/* internals: move me elsewhere -- ay 7/94 */

Aha, about time too!

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

+/* md storage manager funcationality */

Typo.

+/* md sync callback forward declarations */

These aren't "forward" declarations, they're plain old declarations.

+extern char* mdfilepath(FileTag ftag);

Doesn't really matter too much because all of this will get
pgindent-ed at some point, but FYI we write "char *md", not "char*
md".

 #include "storage/smgr.h"
+#include "storage/md.h"
 #include "utils/hsearch.h"

Bad sorting.

+   FileTagData tag;
+   tag.rnode = reln->smgr_rnode.node;
+   tag.forknum = forknum;
+   tag.segno = seg->mdfd_segno;

I wonder if it would be better practice to zero-initialise that
sucker, so that if more members are added we don't leave them
uninitialised.  I like the syntax "FileTagData tag = {{0}}".
(Unfortunately extra nesting required here because first member is a
struct, and C99 doesn't allow us to use empty {} like C++, even though
some versions of GCC accept it.  Rats.)

-- 
Thomas Munro
https://enterprisedb.com



Re: jsonpath

2019-03-05 Thread Robert Haas
On Mon, Mar 4, 2019 at 6:27 PM Tomas Vondra
 wrote:
> 11) Wording of some of the error messages in the execute methods seems a
> bit odd. For example executeNumericItemMethod may complain that it
>
> ... is applied to not a numeric value
>
> but perhaps a more natural wording would be
>
> ... is applied to a non-numeric value
>
> And similarly for the other execute methods. But I'm not a native
> speaker, so perhaps the original wording is just fine.

As a native speaker I can confirm that the first wording is definitely
not OK.  The second one is tolerable, but I wonder if there is
something better, like "can only be applied to a numeric value" or
maybe there's a way to rephrase it so that we complain about the
non-numeric value itself rather than the application, e.g. ERROR:
json_frobnitz can only frob numeric values or ERROR: argument to
json_frobnitz must be numeric.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactoring the checkpointer's fsync request queue

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> +#include "fmgr.h"
> +#include "storage/block.h"
> +#include "storage/relfilenode.h"
> +#include "storage/smgr.h"
> +#include "storage/sync.h"

> Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor.  Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them.  We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

regards, tom lane



Re: Re: \describe*

2019-03-05 Thread Corey Huinker
>
>
> I agree with Andres and Robert.  This patch should be pushed to PG13.
>
> I'll do that on March 8 unless there is a compelling argument not to.
>
>
No objection. I'll continue to work on it, though.


Re: Rare SSL failures on eelpout

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> You can see that poll() already knew the other end had closed the
> socket.  Since this is clearly timing... let's see, yeah, I can make
> it fail every time by adding sleep(1) before the comment "Send the
> startup packet.".  I assume that'll work on any Linux machine?

Great idea, but no cigar --- doesn't do anything for me except make
the ssl test really slow.  (I tried it on RHEL6 and Fedora 28 and, just
for luck, current macOS.)  What this seems to prove is that the thing
that's different about eelpout is the particular kernel it's running,
and that that kernel has some weird timing behavior in this situation.

I've also been experimenting with reducing libpq's SO_SNDBUF setting
on the socket, with more or less the same idea of making the sending
of the startup packet slower.  No joy there either.

Annoying.  I'd be happier about writing code to fix this if I could
reproduce it :-(

regards, tom lane

PS: but now I'm wondering about trying other non-Linux kernels.



Re: Re: Optimze usage of immutable functions as relation

2019-03-05 Thread David Steele

On 2/28/19 4:27 PM, Alexander Kuzmenkov wrote:

On 2/18/19 03:20, Tom Lane wrote:

The dummy-relation stuff I referred to has now been merged, so there's
really no good reason not to revise the patch along that line.



I'll try to post the revised implementation soon.


I'll close this on March 8th if there is no new patch.

Regards,
--
-David
da...@pgmasters.net



Re: proposal: new polymorphic types - commontype and commontypearray

2019-03-05 Thread Pavel Stehule
út 5. 3. 2019 v 15:35 odesílatel Tom Lane  napsal:

> David Steele  writes:
> > This thread has been very quiet for a month.  I agree with Andres [1]
> > that we should push this to PG13.
>
> I think the main thing it's blocked on is disagreement on what the
> type name should be, which is kind of a silly thing to get blocked on,
> but nonetheless it's important ...
>

I sent some others possible names, but probably this mail was forgotten

What about "ctype" like shortcut for common type? carraytype, cnonarraytype?

Regards

Pavel

>
> regards, tom lane
>


Re: insensitive collations

2019-03-05 Thread Daniel Verite
Peter Eisentraut wrote:

> Older ICU versions (<54) don't support all the locale customization
> options, so many of my new tests in collate.icu.utf8.sql will fail on
> older systems.  What should we do about that?  Have another extra test file?

Maybe stick to the old-style syntax for the regression tests?
The declarations that won't work as expected with older ICU versions
would be:

CREATE COLLATION case_insensitive (provider = icu, locale =
'und-u-ks-level2', deterministic = false);

'und-u-ks-level2' is equivalent to 'und@colStrength=secondary'

CREATE COLLATION ignore_accents (provider = icu, locale =
'und-u-ks-level1-kc-true', deterministic = false);

'und-u-ks-level1-kc-true' => 'und@colStrength=primary;colCaseLevel=yes'


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: Inheriting table AMs for partitioned tables

2019-03-05 Thread Andres Freund
On 2019-03-04 22:08:04 -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-03-05 16:01:50 +1300, David Rowley wrote:
> > On Tue, 5 Mar 2019 at 12:47, Andres Freund  wrote:
> > > CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) 
> > > USING heap2;
> > >
> > > SET default_table_access_method = 'heap';
> > > CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> > > VALUES IN ('a');
> > 
> > 
> > > But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> > > quite as clear.  I think it'd both be sensible for new partitions to
> > > inherit the AM from the root, but it'd also be sensible to use the
> > > current default.
> > 
> > I'd suggest it's made to work the same way as ca4103025dfe26 made
> > tablespaces work.
> 
> Hm, is that actually correct?  Because as far as I can tell that doesn't
> have the necessary pg_dump code to make this behaviour persistent:
> 
> CREATE TABLESPACE frak LOCATION '/tmp/frak';
> CREATE TABLE test_tablespace (a text, b int) PARTITION BY list (a) TABLESPACE 
> frak ;
> CREATE TABLE test_tablespace_1 PARTITION OF test_tablespace FOR VALUES in 
> ('a');
> CREATE TABLE test_tablespace_2 PARTITION OF test_tablespace FOR VALUES in 
> ('b') TABLESPACE pg_default;
> CREATE TABLE test_tablespace_3 PARTITION OF test_tablespace FOR VALUES in 
> ('c') TABLESPACE frak;
> 
> SELECT relname, relkind, reltablespace FROM pg_class WHERE relname LIKE 
> 'test_tablespace%' ORDER BY 1;
> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 0 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘
> (4 rows)
> 
> but a dump outputs (abbreviated)
> 
> SET default_tablespace = frak;
> CREATE TABLE public.test_tablespace (
> a text,
> b integer
> )
> PARTITION BY LIST (a);
> CREATE TABLE public.test_tablespace_1 PARTITION OF public.test_tablespace
> FOR VALUES IN ('a');
> SET default_tablespace = '';
> CREATE TABLE public.test_tablespace_2 PARTITION OF public.test_tablespace
> FOR VALUES IN ('b');
> SET default_tablespace = frak;
> CREATE TABLE public.test_tablespace_3 PARTITION OF public.test_tablespace
> FOR VALUES IN ('c');
> 
> which restores to:
> 
> postgres[32125][1]=# SELECT relname, relkind, reltablespace FROM pg_class 
> WHERE relname LIKE 'test_tablespace%' ORDER BY 1;
> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 16384 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘
> (4 rows)
> 
> because public.test_tablespace_2 assumes it's ought to inherit the
> tablespace from the partitioned table.
> 
> 
> I also find it far from clear that:
> 
>  
>   The tablespace_name is the 
> name
>   of the tablespace in which the new table is to be created.
>   If not specified,
>is consulted, or
>if the table is temporary.  For
>   partitioned tables, since no storage is required for the table itself,
>   the tablespace specified here only serves to mark the default tablespace
>   for any newly created partitions when no other tablespace is explicitly
>   specified.
>  
> 
> is handled correctly. The above says that the *specified* tablespaces -
> which seems to exclude the default tablespace - is what's going to
> determine what partitions use as their default tablespace. But in fact
> that's not true, the partitioned table's pg_class.retablespace is set to
> what default_tablespaces was at the time of the creation.
> 
> 
> > i.e. if they specify the storage type when creating
> > the partition, then always use that, unless they mention otherwise. If
> > nothing was mentioned when they created the partition, then use
> > default_table_access_method.
> 
> Hm. That'd be doable, but given the above ambiguities I'm not convinced
> that's the best approach.  As far as I can see that'd require:
> 
> 1) At relation creation, for partitioned tables only, do not take
>default_table_access_method into account.
> 
> 2) At partition creation, if the AM is not specified and if the
>partitioned table's relam is 0, use the default_table_access_method.
> 
> 3) At pg_dump, for partitioned tables only, explicitly emit a USING
>... rather than use the method of manipulating default_table_access_method.
> 
> As far as I can tell, the necessary steps are also what'd need to be
> done to actually implement the described behaviour for TABLESPACE (with
> s/default_table_access_method/default_tablespace/ and s/USING/TABLESPA

Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 6:07 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > You can see that poll() already knew the other end had closed the
> > socket.  Since this is clearly timing... let's see, yeah, I can make
> > it fail every time by adding sleep(1) before the comment "Send the
> > startup packet.".  I assume that'll work on any Linux machine?
>
> Great idea, but no cigar --- doesn't do anything for me except make
> the ssl test really slow.  (I tried it on RHEL6 and Fedora 28 and, just
> for luck, current macOS.)  What this seems to prove is that the thing
> that's different about eelpout is the particular kernel it's running,
> and that that kernel has some weird timing behavior in this situation.
>
> I've also been experimenting with reducing libpq's SO_SNDBUF setting
> on the socket, with more or less the same idea of making the sending
> of the startup packet slower.  No joy there either.
>
> Annoying.  I'd be happier about writing code to fix this if I could
> reproduce it :-(

Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.  But
I just tried the sleep(1) trick on an x86 box running the same version
of Debian, OpenSSL etc and it didn't work.  So eelpout (a super cheap
virtualised 4-core ARMv8 system rented from scaleway.com running
Debian Buster with a kernel identifying itself as 4.9.23-std-1 and
libc6 2.28-7) is indeed starting to look pretty weird.  Let me know if
you want to log in and experiment on that machine.

-- 
Thomas Munro
https://enterprisedb.com



Re: Ordered Partitioned Table Scans

2019-03-05 Thread Robert Haas
On Wed, Dec 19, 2018 at 5:08 PM David Rowley
 wrote:
> With my idea for using live_parts, we'll process the partitions
> looking for interleaved values on each query, after pruning takes
> place. In this case, we'll see the partitions are naturally ordered. I
> don't really foresee any issues with that additional processing since
> it will only be a big effort when there are a large number of
> partitions, and in those cases the planner already has lots of work to
> do. Such processing is just a drop in the ocean when compared to path
> generation for all those partitions.

I agree that partitions_are_ordered() is cheap enough in this patch
that it probably doesn't matter whether we cache the result.  On the
other hand, that's mostly because you haven't handled the hard cases -
e.g. interleaved list partitions.  If you did, then it would be
expensive, and it probably *would* be worth caching the result.  Now
maybe those hard cases aren't worth handling anyway.

You also seem to be saying that since we run-time partitioning pruning
might change the answer, caching the initial answer is pointless.  But
I think Julien has made a good argument for why that's wrong: if the
initial answer is that the partitions are ordered, which will often be
true, then we can skip all later checks.

So I am OK with the fact that this patch doesn't choose to cache it,
but I don't really buy any of your arguments for why it would be a bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inheriting table AMs for partitioned tables

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 12:59 PM Andres Freund  wrote:
> Based on this mail I'm currently planning to simply forbid specifying
> USING for partitioned tables. Then we can argue about this later.

+1.  I actually think that might be the right thing in the long-term,
but it undeniably avoids committing to any particular decision in the
short term, which seems good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Windows 32 bit vs circle test

2019-03-05 Thread Andrew Dunstan


We don't currently have any buildfarm animals running 32 bit mingw
builds for releases > 10. As part of my testing of msys 2 I thought I
would try its 32 bit compiler and got this regression diff on HEAD


cheers


andrew


diff -w -U3
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql/src/test/regress/expected/circle.out
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql.build/src/test/regress/results/circle.out
---
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql/src/test/regress/expected/circle.out
 
2019-03-03 17:14:38.648207000 +
+++
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql.build/src/test/regress/results/circle.out

2019-03-04 19:33:37.576494900 +
@@ -111,8 +111,8 @@
   WHERE (c1.f1 < c2.f1) AND ((c1.f1 <-> c2.f1) > 0)
   ORDER BY distance, area(c1.f1), area(c2.f1);
  five |  one   |  two   | distance
---+++--
-  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
+--+++---
+  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
   | <(3,5),0>  | <(5,1),3>  | 1.47213595499958
   | <(100,200),10> | <(100,1),115>  |   74
   | <(100,200),10> | <(1,2),100>    | 111.370729772479



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: New vacuum option to do only freezing

2019-03-05 Thread Bossart, Nathan
On 3/5/19, 1:22 AM, "Masahiko Sawada"  wrote:
> On Tue, Mar 5, 2019 at 8:27 AM Bossart, Nathan  wrote:
>> +  VACUUM removes dead tuples and prunes HOT-updated
>> +  tuples chain for live tuples on table. If the table has any dead tuple
>> +  it removes them from both table and indexes for re-use. With this
>> +  option VACUUM doesn't completely remove dead tuples
>> +  and disables removing dead tuples from indexes.  This is suitable for
>> +  avoiding transaction ID wraparound (see
>> +  ) but not sufficient for 
>> avoiding
>> +  index bloat. This option is ignored if the table doesn't have index.
>> +  This cannot be used in conjunction with FULL
>> +  option.
>>
>> There are a couple of small changes I would make.  How does something
>> like this sound?
>>
>> VACUUM removes dead tuples and prunes HOT-updated tuple chains for
>> live tuples on the table.  If the table has any dead tuples, it
>> removes them from both the table and its indexes and marks the
>> corresponding line pointers as available for re-use.  With this
>> option, VACUUM still removes dead tuples from the table, but it
>> does not process any indexes, and the line pointers are marked as
>> dead instead of available for re-use.  This is suitable for
>> avoiding transaction ID wraparound (see Section 24.1.5) but not
>> sufficient for avoiding index bloat.  This option is ignored if
>> the table does not have any indexes.  This cannot be used in
>> conjunction with the FULL option.
>
> Hmm, that's good idea but I'm not sure that user knows the word 'line
> pointer' because it is used only at pageinspect document. I wonder if
> the word 'item identifier' would rather be appropriate here because
> this is used at at describing page layout(in storage.sgml). Thought?

That seems reasonable to me.  It seems like ItemIdData is referred to
as "item pointer," "line pointer," or "item identifier" depending on
where you're looking, but ItemPointerData is also referred to as "item
pointer."  I think using "item identifier" is appropriate here for
clarity and consistency with storage.sgml.

>> tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, 
>> false,
>> -
>> &vacrelstats->latestRemovedXid);
>> +
>> &vacrelstats->latestRemovedXid,
>> +
>> &tups_pruned);
>>
>> Why do we need a separate tups_pruned argument in heap_page_prune()?
>> Could we add the result of heap_page_prune() to tups_pruned instead,
>> then report the total number of removed tuples as tups_vacuumed +
>> tups_pruned elsewhere?
>
> Hmm, I thought that we should report only the number of tuples
> completely removed but we already count the tulples marked as
> redirected as tups_vacuumed. Let me summarize the fate of dead tuples.
> I think we can roughly classify dead tuples as follows.
>
> 1. root tuple of HOT chain that became dead
> 2. root tuple of HOT chain that became redirected
> 3. other tupels of HOT chain that became unused
> 4. tuples that became dead after HOT pruning
>
> The tuples of #1 through #3 either have only ItemIDs or have been
> completely removed but tuples of #4 has its tuple storage because they
> are not processed when HOT-pruning.
>
> Currently tups_vacuumed counts all of them, nleft (=
> vacrelstats->num_dead_tuples) counts #1 + #4. I think that the number
> of removed tuples being reported would be #1 + #2 + #3. Or should we
> use  #2 + #3 instead?

I think I'm actually more in favor of what was in v6.  IIRC that
version of the patch didn't modify how we tracked the "removed" tuples
at all, but it just added the "X item identifiers left marked dead"
metric.  Since even the tuples we are leaving marked dead lose
storage, that seems accurate enough to me.

>> Another interesting thing I noticed is that this "removed X row
>> versions" message is only emitted if vacuumed_pages is greater than 0.
>> However, if we only did HOT pruning, tups_vacuumed will be greater
>> than 0 while vacuumed_pages will still be 0, so some information will
>> be left out.  I think this is already the case, though, so this could
>> probably be handled in a separate thread.
>>
>
> Hmm, since this log message is corresponding to the one that
> lazy_vacuum_heap makes and total number of removed tuples are always
> reported, it seems consistent to me. Do you have another point?

Here's an example:

postgres=# CREATE TABLE test (a INT, b INT);
CREATE TABLE
postgres=# CREATE INDEX ON test (a);
CREATE INDEX
postgres=# INSERT INTO test VALUES (1, 2);
INSERT 0 1

After only HOT updates, the "removed X row versions in Y pages"
message is not emitted:

postgres=# UPDATE test SET b = 3;
UPD

Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 7:05 AM Thomas Munro  wrote:
> On Wed, Mar 6, 2019 at 6:07 AM Tom Lane  wrote:
> > Annoying.  I'd be happier about writing code to fix this if I could
> > reproduce it :-(
>
> Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.

Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
x86, apparently the same version of OpenSSL and all other details of
the test the same, I can see that SSL_connect() returns <= 0
(failure), and then we ask for that cert revoked message directly and
never even reach the startup packet sending code.  On ARM,
SSL_connect() returns 1 (success) and then we proceed as discussed and
eventually get the error later (or not).  So I think I should figure
out a minimal repro and report this to them.

-- 
Thomas Munro
https://enterprisedb.com



Re: any plan to support shared servers like Oracle in PG?

2019-03-05 Thread legrand legrand
There already are solutions regarding this feature in Postgres
 using "connection pooler" wording

see 

pgpool: http://www.pgpool.net/mediawiki/index.php/Main_Page

pgbouncer: https://pgbouncer.github.io/

there are also discussions to include this as a core feature

https://www.postgresql.org/message-id/flat/4b971a8f-ff61-40eb-8f30-7b57eb0fdf9d%40postgrespro.ru



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-05 Thread Sergei Kornilov
Hello, David!

> I've made a pass over v10. I think it's in pretty good shape, but I
> did end up changing a few small things.

Thank you! I merged your changes to new patch version.

> The only thing that I'm a bit unsure of is the tests. I've read the
> thread and I see the discussion above about it. I'd personally have
> thought INFO was fine since ATTACH PARTITION does that, but I see
> objections.

It is unclear for me if we have consensus about INFO ereport in ATTACH 
PARTITION.

> It appears that all the tests just assume that the CHECK
> constraint was used to validate the SET NOT NULL. I'm not all that
> sure if there's any good reason not to set client_min_messages =
> 'debug1' just before your test then RESET client_min_messages;
> afterwards. No other tests seem to do it, but my only thoughts on the
> reason not to would be that it might fail if someone added another
> debug somewhere, but so what? they should update the expected results
> too.

Not sure we have consensus about debug messages in tests, so I did such test 
changes in additional patch.

> separate that part out and
> put back the function named PartConstraintImpliedByRelConstraint and
> have it do the IS NOT NULL building before calling
> ConstraintImpliedByRelConstraint(). No duplicate code that way and you
> also don't need to touch partbound.c

In this case we need extra argument for ConstraintImpliedByRelConstraint for 
some caller-provided existConstraint, right? Along with Relation itself? Then I 
need make copy of existConstraint, append relation constraints and call 
predicate_implied_by. If I understood correctly pseudocode would be:

PartConstraintImpliedByRelConstraint(rel, testConstraint)
  generate notNullConstraint from NOT NULL table attributes
  call ConstraintImpliedByRelConstraint(rel, testConstraint, notNullConstraint)

ConstraintImpliedByRelConstraint(rel, testConstraint, existsConstraint)
  copy existsConstraint to localExistsConstraint variable
  append relation constraints to localExistsConstraint list
  call predicate_implied_by
  
I thing redundant IS NOT NULL check is not issue here and we not need extra 
arguments for ConstraintImpliedByRelConstraint. Was not changed in attached 
version, but I will change if you think this would be better design.

regards, Sergeidiff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 890b23afd6..926b3361ea 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -214,8 +214,17 @@ WITH ( MODULUS numeric_literal, REM
 
  
   These forms change whether a column is marked to allow null
-  values or to reject null values.  You can only use SET
-  NOT NULL when the column contains no null values.
+  values or to reject null values.
+ 
+
+ 
+  SET NOT NULL may only be applied to a column
+  providing none of the records in the table contain a
+  NULL value for the column.  Ordinarily this is
+  checked during the ALTER TABLE by scanning the
+  entire table, however if a valid CHECK constraint is
+  found which proves no NULL can exist, then the
+  table scan is skipped.
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a93b13c2fe..b03e1bc875 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -158,7 +158,7 @@ typedef struct AlteredTableInfo
 	/* Information saved by Phases 1/2 for Phase 3: */
 	List	   *constraints;	/* List of NewConstraint */
 	List	   *newvals;		/* List of NewColumnValue */
-	bool		new_notnull;	/* T if we added new NOT NULL constraints */
+	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
@@ -370,6 +370,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, LOCKMO
 static void ATPrepSetNotNull(Relation rel, bool recurse, bool recursing);
 static ObjectAddress ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
  const char *colName, LOCKMODE lockmode);
+static bool NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr);
 static ObjectAddress ATExecColumnDefault(Relation rel, const char *colName,
 	Node *newDefault, LOCKMODE lockmode);
 static ObjectAddress ATExecAddIdentity(Relation rel, const char *colName,
@@ -4506,10 +4507,11 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
 		else
 		{
 			/*
-			 * Test the current data within the table against new constraints
-			 * generated by ALTER TABLE commands, but don't rebuild data.
+			 * If required, test the current data within the table against new
+			 * constraints generated by ALTER TABLE commands, but don't rebuild
+			 * data.
 			 */
-			if (tab->constraints != NIL || tab->new_notnull ||
+			if (tab->constraints != N

Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Paul Ramsey



> On Mar 4, 2019, at 4:22 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>>> On Mar 4, 2019, at 2:52 PM, Tom Lane  wrote:
>>> BTW, if you'd like me to review the code you added for this, I'd be happy
>>> to do so.  I've never looked at PostGIS' innards, but probably I can make
>>> sense of the code for this despite that.
> 
>> I would be ecstatic for a review, I'm sure I've left a million loose threads 
>> dangling.
> 
> I took a look, and saw that you'd neglected to check pseudoconstantness
> of the non-index argument, so this'd fail on cases like ST_DWithin(x, y)
> where x is indexed and y is another column in the same table.  Also
> I thought the handling of commutation could be done better.  Attached is
> a suggested patch atop your f731c1b7022381dbf627cae311c3d37791bf40c3 to
> fix those and a couple of nitpicky other things.  (I haven't tested this,
> mind you.)
> 
> One thing that makes me itch, but I didn't address in the attached,
> is that expandFunctionOid() is looking up a function by name without
> any schema-qualification.  That'll fail outright if PostGIS isn't in
> the search path, and even if it is, you've got security issues there.
> One way to address this is to assume that the expandfn is in the same
> schema as the ST_XXX function you're attached to, so you could
> do "get_namespace_name(get_func_namespace(funcid))" and then include
> that in the list passed to LookupFuncName.

Thanks for the patch, I’ve applied and smoothed and taken your advice on 
schema-qualified lookups as well.

> Also, this might be as-intended but I was wondering: I'd sort of expected
> you to make, eg, _ST_DWithin() and ST_DWithin() into exact synonyms.
> They aren't, since the former is not connected to the support function.
> Is that intentional?  I guess if you had a situation where you wanted to
> force non-use of an index, being able to use _ST_DWithin() for that would
> be helpful.

Yes, this is by design. Other parts of the internal code base still like access 
to _ST_Functions, and there’s a non-zero chance that some 3rd party callers 
still want them too. There’s a certain utility in having “guaranteed not 
indexed” things so you can combine them with your other indexes of choice 
(particularly given the insane zoo of indexes built against geometry).

Again, many many thanks for your help! Next stop, costing.

P.




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-05 Thread Peter Geoghegan
On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas  wrote:
> I'm looking at the first patch in the series now. I'd suggest that you
> commit that very soon. It's useful on its own, and seems pretty much
> ready to be committed already. I don't think it will be much affected by
> whatever changes we make to the later patches, anymore.

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches. Some of the things that make _bt_findinsertloc()
fast are missing for v3 indexes. The "consider secondary factors
during nbtree splits" patch actually more than compensates for that
with v3 indexes, at least in some cases, but the first patch applied
on its own will slightly regress performance. At least, I benchmarked
the first patch on its own several months ago and noticed a small
regression at the time, though I don't have the exact details at hand.
It might have been an invalid result, because I wasn't particularly
thorough at the time.

We do make some gains in the first patch  (the _bt_check_unique()
thing), but we also check the high key more than we need to within
_bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming
thing isn't as streamlined.

It's a lot of work to validate and revalidate the performance of a
patch like this, and I'd rather commit the first three patches within
a couple of days of each other (I can validate v3 indexes and v4
indexes separately, though). We can put off the other patches for
longer, and treat them as independent. I guess I'd also push the final
amcheck patch following the first three -- no point in holding back on
that. Then we'd be left with "Add "split after new tuple"
optimization", and "Add high key "continuescan" optimization" as
independent improvements that can be pushed at the last minute of the
final CF.

> I also had a few comments and questions on some details. I added them in
> the same patch, marked with "HEIKKI:". Please take a look.

Will respond now. Any point that I haven't responding to directly has
been accepted.

> +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
> +weird to talk about it in the function comment. I didn't understand what
> +the point of adding this sentence was, so I removed it.

Maybe there is no point in the comment you reference here, but I like
the idea of "checkingunique", because that symbol name is a common
thread between a number of functions that coordinate with each other.
It's not just a local variable in one function.

> @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
> if (P_RIGHTMOST(opaque))
> break;
> highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
> +
> +   /*
> +* HEIKKI: This assertion might fire if the user-defined opclass
> +* is broken. It's just an assertion, so maybe that's ok. With a
> +* broken opclass, it's obviously "garbage in, garbage out", but
> +* we should try to behave sanely anyway. I don't remember what
> +* our general policy on that is; should we assert, elog(ERROR),
> +* or continue silently in that case? An elog(ERROR) or
> +* elog(WARNING) would feel best to me, but I don't remember what
> +* we usually do.
> +*/
> Assert(highkeycmp <= 0);
> if (highkeycmp != 0)
> break;

We don't really have a general policy on it. However, I don't have any
sympathy for the idea of trying to solider on with a corrupt index. I
also don't think that it's worth making this a "can't happen" error.
Like many of my assertions, this assertion is intended to document an
invariant. I don't actually anticipate that it could ever really fail.

> +Should we mention explicitly that this binary-search reuse is only applicable
> +if unique checks were performed? It's kind of implied by the fact that it's
> +_bt_check_unique() that saves the state, but perhaps we should be more clear
> +about it.

I guess so.

> +What is a "garbage duplicate"? Same as a "dead duplicate"?

Yes.

> +The last sentence, about garbage duplicates, seems really vague. Why do we
> +ever do any comparisons that are not strictly necessary? Perhaps it's best to
> +just remove that last sentence.

Okay -- will remove.

> +
> +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
> +could easily imagine a separate function that does microvacuuming, and resets
> +(or even updates) the binary-search cache in the insertion key. I agree this
> +is a convenient place to do it, though.

It wasn't supposed to be a water-tight argument. I'll just say that
it's convenient.

> +/* HEIKKI:
> +Do we need 'checkunique' as an argument? If unique checks were not
> +performed, the insertion key will simply not have saved state.
> +*/

We need it in the next patch in the series, because i

Re: Rare SSL failures on eelpout

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
> x86, apparently the same version of OpenSSL and all other details of
> the test the same, I can see that SSL_connect() returns <= 0
> (failure), and then we ask for that cert revoked message directly and
> never even reach the startup packet sending code.  On ARM,
> SSL_connect() returns 1 (success) and then we proceed as discussed and
> eventually get the error later (or not).  So I think I should figure
> out a minimal repro and report this to them.

Yeah, I've still been unable to reproduce even with the sleep idea,
so eelpout is definitely looking like a special snowflake from here.
In any case, there seems little doubt that getting past SSL_connect()
when the cert check has failed is an OpenSSL bug; I don't feel a need
to create a workaround for it.

The bug #15598 report is more troublesome, as we don't have a strong
reason to believe it's not common on Windows.  However, I wonder whether
we can really do anything at all about that one.  If I understand what
Andrew was hypothesizing in that thread, it was that Windows might be
dropping undelivered data on the floor once the server closes its end
of the connection.  That would be totally broken behavior, but I never
expect anything else from Microsoft :-(.  If that is an accurate theory
then rewriting libpq won't fix it.

I still think the redesign I suggested upthread would make things cleaner,
but I don't have time or interest to make it happen in the near future
if it's not fixing an observable bug.

regards, tom lane



Fwd: [Issue] Can't recompile cube extension as PGXS, utils/float.h is not installed

2019-03-05 Thread Siarhei Siniak
-- Forwarded message -
From: Siarhei Siniak 
Date: Tue, 5 Mar 2019 at 23:31
Subject: Re: [Issue] Can't recompile cube extension as PGXS, utils/float.h
is not installed
To: Tom Lane 


>AFAICT, that file only exists in HEAD, not in any released branch, and
>it is installed during "make install" from HEAD.  Please be sure you
>are using installed files that match whatever branch you're trying
>to build from.
Yeah june and july 2018, a month in between. Just thought a release was not
so long ago.
```
git log  REL_11_BETA2..6bf0bc842bd75 --format=oneline -- | wc -l
# 175
```
Ok, then probably no more questions.


Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-03-05 Thread Bruce Momjian
On Tue, Mar  5, 2019 at 12:24:14AM -0300, Alvaro Herrera wrote:
> On 2019-Mar-04, Bruce Momjian wrote:
> 
> > On Thu, Feb 28, 2019 at 10:28:44AM +0300, Sergei Kornilov wrote:
> > > Hello
> > > 
> > > postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
> > > ab1d9f066aee4f9b81abde6136771debe0191ae8)
> > > So config will be changed in next minor release anyway. We have another 
> > > reason to not fix bgwriter_lru_maxpages comment?
> > 
> > Frankly, I was surprised postgresql.conf.sample was changed in a back
> > branch since it will cause people who diff $PGDATA/postgresql.conf with
> > share/postgresql.conf.sample to see differences they didn't make.
> 
> I think the set of people that execute diffs of their production conf
> file against the sample file to be pretty small -- maybe even empty.  If
> you're really interested in knowing the changes you've done, you're more
> likely to use a version control system on the file anyway.

Well, if this is true, then we should all agree to backpatch to
postgresql.conf.sample more often.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2019-03-05 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:27 PM Robert Haas  wrote:
> I'm not currently aware of any remaining correctness issues with this
> code, although certainly there may be some.  There has been a certain
> dearth of volunteers to review any of this.  I do plan to poke at it a
> bit to see whether it has any significant performance impact, but not
> today.

Today, did some performance testing.  I created a table with 100
partitions and randomly selected rows from it using pgbench, with and
without -M prepared.  The results show a small regression, but I
believe it's below the noise floor.  Five minute test runs.

with prepared queries

master:
tps = 10919.914458 (including connections establishing)
tps = 10876.271217 (including connections establishing)
tps = 10761.586160 (including connections establishing)

concurrent-attach:
tps = 10883.535582 (including connections establishing)
tps = 10868.471805 (including connections establishing)
tps = 10761.586160 (including connections establishing)

with simple queries

master:
tps = 1486.120530 (including connections establishing)
tps = 1486.797251 (including connections establishing)
tps = 1494.129256 (including connections establishing)

concurrent-attach:
tps = 1481.774212 (including connections establishing)
tps = 1472.159016 (including connections establishing)
tps = 1476.444097 (including connections establishing)

Looking at the total of the three results, that's about an 0.8%
regression with simple queries and an 0.2% regression with prepared
queries.  Looking at the median, it's about 0.7% and 0.07%.  Would
anybody like to argue that's a reason not to commit these patches?

Would anyone like to argue that there is any other reason not to
commit these patches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch to document base64 encoding

2019-03-05 Thread Fabien COELHO



Hello Karl,


Attached: doc_base64_v3.patch


I'm ok with referencing the historical MIME RFC.

"RFC2045 section 6.8" -> "RFC 2045 Section 6.8"

you can link to the RFC directly with:

https://tools.ietf.org/html/rfc2045#section-6.8";>RFC 2045 
Section 6.8


--
Fabien.



Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 9:21 AM Tom Lane  wrote:
> The bug #15598 report is more troublesome, as we don't have a strong
> reason to believe it's not common on Windows.  However, I wonder whether
> we can really do anything at all about that one.  If I understand what
> Andrew was hypothesizing in that thread, it was that Windows might be
> dropping undelivered data on the floor once the server closes its end
> of the connection.  That would be totally broken behavior, but I never
> expect anything else from Microsoft :-(.  If that is an accurate theory
> then rewriting libpq won't fix it.

Here is a stupid Python 2.7 program to try to test that.  Run one copy
of it like this:

$ python ./test.py --server

The server will wait for a client, send a message immediately, and
then close the socket after a second.  The client will connect and
send something once before and twice after the server closes the
socket, and finally see if it can read the message from the server.

Here's the output I get from the client on some different systems (the
server was running on the same system):

$ uname -a
Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23)
x86_64 GNU/Linux
$ python ./test.py --client
Sending A...
2
Sending B...
[Errno 104] Connection reset by peer
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
Darwin macaque.local 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20
20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
[Errno 32] Broken pipe
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
FreeBSD dogmatix 13.0-CURRENT FreeBSD 13.0-CURRENT c0873ea614a(master)
GENERIC  amd64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
2
Sending C...
2
This is the server saying goodbye

So... can anyone tell us what happens on Windows?

(A secondary question might be what happens if the server and client
are on different machines since I guess it could be different?)

-- 
Thomas Munro
https://enterprisedb.com
import socket
import sys
import time

address = ("localhost", )

def server():
  ss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  ss.bind(address)
  ss.listen(5)
  (s, other_address) = ss.accept()
  # Send a message, but then close the socket shortly after.
  # Will the client manage to read the goodbye message, if it tries to write
  # first?
  s.send("This is the server saying goodbye\n")
  time.sleep(1)
  s.close()

def client():
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect(address)
  # Try to send a message immediately.  At this point we expect the server
  # end to be waiting for a second before closing, so it should be OK.
  try:
print "Sending A..."
sent = s.send("A\n")
print sent
  except Exception, e:
# (Should not be reached)
print e
  # Wait until after the server has closed its socket, and try to send
  # another message.
  time.sleep(2)
  try:
print "Sending B..."
sent = s.send("B\n")
print sent
  except Exception, e:
# We expect an error either here or at the next write (?)
print e
  # Send one more message, just to see if perhaps an error will be reported
  # here rather than earlier...
  try:
print "Sending C..."
sent = s.send("C\n")
print sent
  except Exception, e:
# What about now?
print e
  # Can we read the goodbye message?
  print s.recv(1024)

if __name__ == "__main__":
  if sys.argv[1] == "--server":
server()
  elif sys.argv[1] == "--client":
client()


Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Andrew Dunstan


On 2/25/19 8:38 AM, David Rowley wrote:
> On Tue, 26 Feb 2019 at 02:06, Joe Conway  wrote:
>> On 2/25/19 1:17 AM, Peter Geoghegan wrote:
>>> On Sun, Feb 24, 2019 at 9:42 PM David Rowley
>>>  wrote:
 The current default vacuum_cost_limit of 200 seems to be 15 years old
 and was added in f425b605f4e.

 Any supporters for raising the default?
>>> I also think that the current default limit is far too conservative.
>> I agree entirely. In my experience you are usually much better off if
>> vacuum finishes quickly. Personally I think our default scale factors
>> are horrible too, especially when there are tables with large numbers of
>> rows.
> Agreed that the scale factors are not perfect, but I don't think
> changing them is as quite a no-brainer as the vacuum_cost_limit, so
> the attached patch just does the vacuum_cost_limit.
>
> I decided to do the times by 10 option that I had mentioned Ensue
> debate about that...
>
> I'll add this to the March commitfest and set the target version as PG12.
>

This patch is tiny, seems perfectly reasonable, and has plenty of
support. I'm going to commit it shortly unless there are last minute
objections.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Andres Freund
On 2019-03-05 17:14:55 -0500, Andrew Dunstan wrote:
> This patch is tiny, seems perfectly reasonable, and has plenty of
> support. I'm going to commit it shortly unless there are last minute
> objections.

+1



Re: A separate table level option to control compression

2019-03-05 Thread Andrew Dunstan


On 2/6/19 2:32 AM, Pavan Deolasee wrote:
> Hello,
>
> Currently either the table level option `toast_tuple_target` or the
> compile time default `TOAST_TUPLE_TARGET` is used to decide whether a
> new tuple should be compressed or not. While this works reasonably
> well for most situations, at times the user may not want to pay the
> overhead of toasting, yet take benefits of inline compression.
>
> I would like to propose a new table level option,
> compress_tuple_target, which can be set independently of
> toast_tuple_target, and is checked while deciding whether to compress
> the new tuple or not.
>
> For example,
>
> CREATE TABLE compresstest250 (a int, b text) WITH
> (compress_tuple_target = 250);
> CREATE TABLE compresstest2040 (a int, b text) WITH
> (compress_tuple_target = 2040);
>
> -- shouldn't get compressed nor toasted
> INSERT INTO compresstest250 VALUES (1, repeat('1234567890',20));
>
> -- should get compressed, but not toasted
> INSERT INTO compresstest250 VALUES (2, repeat('1234567890',30));
>
> -- shouldn't get compressed nor toasted
> INSERT INTO compresstest2040 VALUES (1, repeat('1234567890',20));
> INSERT INTO compresstest2040 VALUES (2, repeat('1234567890',30));
>
> Without this patch, the second INSERT will not compress the tuple
> since its length is less than the toast threshold. With the patch and
> after setting table level option, one can compress such tuples.
>
> The attached patch implements this idea. 
>


This is a nice idea, and I'm a bit surprised it hasn't got more
attention. The patch itself seems very simple and straightforward,
although it could probably do with having several sets of eyeballs on it.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Incomplete startup packet errors

2019-03-05 Thread Andrew Dunstan


On 3/4/19 7:42 AM, Christoph Berg wrote:
> Re: Andrew Dunstan 2019-03-04 
> <7cc6d2c1-bd87-9890-259d-36739c247...@2ndquadrant.com>
>> Looks good to me.
> +1.
>


OK, I think we have agreement on Tom's patch. Do we want to backpatch
it? It's a change in behaviour, but I find it hard to believe anyone
relies on the existence of these annoying messages, so my vote would be
to backpatch it.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: patch to allow disable of WAL recycling

2019-03-05 Thread Jerry Jelinek
Alvaro,

Thanks again for your review. I went through your proposed patch diffs and
applied most of them to my original changes. I did a few things slightly
differently since I wanted to keep to to 80 columns for the source code,
but I can revisit that if it is not an issue. I also cleaned up the
confusing wording around "allocating blocks". I ran a clean build and make
check passes. The new patch is attached.

Thanks,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


0001-cow-filesystem.patch
Description: Binary data


Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread David Rowley
Thanks for chipping in on this.

On Wed, 6 Mar 2019 at 01:53, Tomas Vondra  wrote:
> But on the other hand it feels a bit weird that we increase this one
> value and leave all the other (also very conservative) defaults alone.

Which others did you have in mind? Like work_mem, shared_buffers?  If
so, I mentioned in the initial post that I don't see vacuum_cost_limit
as in the same category as those.  It's not like PostgreSQL won't
start on a tiny server if vacuum_cost_limit is too high, but you will
have issues with too big a shared_buffers, for example.   I think if
we insist that this patch is a review of all the "how big is your
server" GUCs then that's raising the bar significantly and
unnecessarily for what I'm proposing here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 08:41, Sergei Kornilov  wrote:
> In this case we need extra argument for ConstraintImpliedByRelConstraint for 
> some caller-provided existConstraint, right? Along with Relation itself? Then 
> I need make copy of existConstraint, append relation constraints and call 
> predicate_implied_by. If I understood correctly pseudocode would be:
>
> PartConstraintImpliedByRelConstraint(rel, testConstraint)
>   generate notNullConstraint from NOT NULL table attributes
>   call ConstraintImpliedByRelConstraint(rel, testConstraint, 
> notNullConstraint)
>
> ConstraintImpliedByRelConstraint(rel, testConstraint, existsConstraint)
>   copy existsConstraint to localExistsConstraint variable
>   append relation constraints to localExistsConstraint list
>   call predicate_implied_by
>
> I thing redundant IS NOT NULL check is not issue here and we not need extra 
> arguments for ConstraintImpliedByRelConstraint. Was not changed in attached 
> version, but I will change if you think this would be better design.

I was really just throwing the idea out there for review. I don't
think that I'm insisting on it.

FWIW I think you could get away without the copy of the constraint
providing it was documented that the list is modified during the
ConstraintImpliedByRelConstraint call and callers must make a copy
themselves if they need an unmodified version. Certainly,
PartConstraintImpliedByRelConstraint won't need to make a copy, so
there's probably not much reason to assume that possible future
callers will.

Providing I'm imagining it correctly, I do think the patch is slightly
cleaner with that change.

It's:
a) slightly more efficient due to not needlessly checking a bunch of
IS NOT NULLs (imagine a 1000 column table with almost all NOT NULL and
a single CHECK constraint); and
b) patch has a smaller footprint (does not modify existing callers of
PartConstraintImpliedByRelConstraint()); and
c) does not modify an existing API function.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 03:37, Tom Lane  wrote:
>
> David Steele  writes:
> > I'm not sure if I have an issue with competing patches on the same
> > thread.  I've seen that before and it can lead to a good outcome.  It
> > case, as you say, also lead to confusion.
>
> It's a bit of a shame that the cfbot will only be testing one of them
> at a time if we leave it like this.  I kind of lean towards the
> two-thread, two-CF-entry approach because of that.  The amount of
> confusion is a constant.

That sounds fine. I'll take mine elsewhere since I didn't start this thread.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Tom Lane
Paul Ramsey  writes:
> Thanks for the patch, I’ve applied and smoothed and taken your advice on 
> schema-qualified lookups as well.

Hm, I think your addition of this bit is wrong:

+/*
+* Arguments were swapped to put the index value on the
+* left, so we need the commutated operator for
+* the OpExpr
+*/
+if (swapped)
+{
+oproid = get_commutator(oproid);
+if (!OidIsValid(oproid))
 PG_RETURN_POINTER((Node *)NULL);
+}

We already did the operator lookup with the argument types in the desired
order, so this is introducing an extra swap.  The only reason it appears
to work, I suspect, is that all your index operators are self-commutators.

regards, tom lane



Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Paul Ramsey


> On Mar 5, 2019, at 3:26 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> Thanks for the patch, I’ve applied and smoothed and taken your advice on 
>> schema-qualified lookups as well.
> 
> Hm, I think your addition of this bit is wrong:
> 
> +/*
> +* Arguments were swapped to put the index value on the
> +* left, so we need the commutated operator for
> +* the OpExpr
> +*/
> +if (swapped)
> +{
> +oproid = get_commutator(oproid);
> +if (!OidIsValid(oproid))
> PG_RETURN_POINTER((Node *)NULL);
> +}
> 
> We already did the operator lookup with the argument types in the desired
> order, so this is introducing an extra swap.  The only reason it appears
> to work, I suspect, is that all your index operators are self-commutators.

I was getting regression failures until I re-swapped the operator… 

  SELEcT * FROM foobar WHERE ST_Within(ConstA, VarB)

Place the indexed operator in the Left, now:

  Left == VarB
  Right == ConstA
  Strategy == Within
  get_opfamily_member(opfamilyoid, Left, Right, Within)

Unless we change the strategy number when we assign the left/right we’re 
looking up an operator for “B within A”, so we’re backwards.

I feel OK about it, if for no other reason than it passes all the tests :)

P


Converting NOT IN to anti-joins during planning

2019-03-05 Thread David Rowley
Way back in [1] I proposed that we allow NOT IN subqueries to be
converted into an anti-join where the subquery cannot return any NULL
values.   As Tom pointed out to me, I had neglected to consider that
the outer side producing NULLs can cause the anti-join plan to produce
incorrect results.  The difference is that a NOT IN where the subquery
returns no results filters nothing, otherwise it filters the nulls,
plus the records that exist in the subquery.

More recently over on [2], Jim and Zheng have re-proposed making
improvements in this area. Their ideas are slightly different from
mine as they propose to add an OR .. IS NULL clause to the join
condition to handle the outer side being NULL with empty subquery
problem.  Before Jim and Zheng's patch arrived I managed to fix the
known problems with my 4-year-old patch thinking it would have been
welcome, but it seems that's not the case, perhaps due to the
differing ideas we have on how this should work. At that time I didn't
think the other patch actually existed yet... oops

Anyway, I don't really want to drop my patch as I believe what it does
is correct and there's debate on the other thread about how good an
idea adding these OR clauses to the join quals is... (forces nested
loop plan (see [3])),  but it appears Jim and Zheng are fairly set on
that idea.  Hence...

I'm moving my patch here, so it can be debated without interfering
with the other work that's going on in this area.  There has also been
some review of my patch in [4], and of course, originally in [1].

The background is really.

1. Seems fine to do this transformation when there are no nulls.
2. We don't want to cost anything to decide on to do the
transformation or not, i.e do it regardless, in all possible cases
where it's valid to do so. We already do that for NOT EXISTS, no
apparent reason to think this case is any different.
3. Need to consider what planner overhead there is from doing this and
failing to do the conversion due lack of evidence for no NULLs.

I've not done #3, at least not with the latest patch.

There's already a CF entry [5] for this patch, although its targeting PG13.

The latest patch is attached.

[1] 
https://www.postgresql.org/message-id/CAApHDvqRB-iFBy68%3DdCgqS46aRep7AuN2pou4KTwL8kX9YOcTQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/1550706289606-0.p...@n3.nabble.com
[3] 
https://www.postgresql.org/message-id/CAKJS1f_ZwXtzPz6wDpBXgAVYuxforsqpc6hBw05Y6aPGcOONfA%40mail.gmail.com
[4] https://www.postgresql.org/message-id/18203.1551543939%40sss.pgh.pa.us
[5] https://commitfest.postgresql.org/22/2020/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


not_in_anti_join_v1.3.patch
Description: Binary data


Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Tom Lane
Paul Ramsey  writes:
> On Mar 5, 2019, at 3:26 PM, Tom Lane  wrote:
>> Hm, I think your addition of this bit is wrong:
>> 
>> +/*
>> +* Arguments were swapped to put the index value on the
>> +* left, so we need the commutated operator for
>> +* the OpExpr
>> +*/
>> +if (swapped)
>> +{
>> +oproid = get_commutator(oproid);
>> +if (!OidIsValid(oproid))
>> PG_RETURN_POINTER((Node *)NULL);
>> +}
>> 
>> We already did the operator lookup with the argument types in the desired
>> order, so this is introducing an extra swap.  The only reason it appears
>> to work, I suspect, is that all your index operators are self-commutators.

> I was getting regression failures until I re-swapped the operator… 
>   SELEcT * FROM foobar WHERE ST_Within(ConstA, VarB)

Ah ... so the real problem here is that *not* all of your functions
treat their first two inputs alike, and the hypothetical future
improvement I commented about is needed right now.  I should've
looked more closely at the strategies in your table; then I would've
realized the patch as I proposed it didn't work.

But this code isn't right either.  I'm surprised you're not getting
crashes --- perhaps there aren't cases where the first and second args
are of incompatible types?  Also, it's certainly wrong to be doing this
sort of swap in only one of the two code paths.

There's more than one way you could handle this, but the way that
I was vaguely imagining was to have two strategy entries in each
IndexableFunction entry, one to apply if the first function argument
is the indexable one, and the other to apply if the second function
argument is the indexable one.  If you leave the operator lookup as
I had it (using the already-swapped data types) then you'd have to
make sure that the latter set of strategy entries are written as
if the arguments get swapped before selecting the strategy, which
would be confusing perhaps :-( --- for instance, st_within would
use RTContainedByStrategyNumber in the first case but
RTContainsStrategyNumber in the second.  But otherwise you need the
separate get_commutator step, which seems like one more catalog lookup
than you really need.

> I feel OK about it, if for no other reason than it passes all the tests :)

Then you're at least missing adequate tests for the 3-arg functions...
3 args with the index column second will not work as this stands.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 12:25, David Rowley  wrote:
> That sounds fine. I'll take mine elsewhere since I didn't start this thread.

Moved to 
https://www.postgresql.org/message-id/CAKJS1f82pqjqe3WT9_xREmXyG20aOkHc-XqkKZG_yMA7JVJ3Tw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:46 AM David Steele  wrote:
> I have marked this entry as targeting PG13 since it is too late to
> consider for PG12.

Sounds right.  As Peter said himself, this patch is WIP, so it's too
soon to consider integrating it.  This is also fairly evident from the
content of the patch, which is full of comments marked XXX and PEMOSER
that obviously need to be addressed somehow.  For all of that, I'd say
this patch is much closer to being on the right track than the old
design, even though it's clear that a lot of work remains.

Some desultory review comments:

+#define setSweepline(datum) \
+ node->sweepline = datumCopy(datum, node->datumFormat->attbyval,
node->datumFormat->attlen)
+
+#define freeSweepline() \
+ if (! node->datumFormat->attbyval) pfree(DatumGetPointer(node->sweepline))

I am quite dubious about this. Almost everywhere in the executor we
rely on slots to keep track of tuples and free memory for us. It seems
unlikely that this should be the one place where we have code that
looks completely different.  Aside from that, this seems to propose
there is only one relevant column, which seems like an assumption that
we probably don't want to bake too deeply into the code.

+ ereport(ERROR,
+ (errcode(ERRCODE_NOT_NULL_VIOLATION),
+ errmsg("Attribute \"%s\" at position %d is null. Temporal " \
+ "adjustment not possible.",
+ NameStr(TupleDescAttr(slot->tts_tupleDescriptor, attnum - 1)->attname),
+ attnum)));

This error message is marked as translatable (ereport used rather than
elog) but the error-message text is unsuitable for a user-facing
error.  If this is a user-facing error, we need a better error, or
maybe we need to rethink the semantics altogether (e.g. just skip such
rows instead of erroring out, or something).  If it's an internal
error that should not be possible no matter what query the user
enters, and is only here as a sanity test, just simplify and use elog
(and maybe add some comments explaining why that's so).

+ * heapGetAttrNotNull

I may be a bit behind the times here, but it seems to me that this is
functionally equivalent to slotGetAttrNotNull and thus we shouldn't
need both.

+ boolempty = false;

Not much point in declaring a variable whose value is never changed
and whose value takes up exactly the same number of characters as the
variable name.

+ * temporalAdjustmentStoreTuple
+ *  While we store result tuples, we must add the newly calculated temporal
+ *  boundaries as two scalar fields or create a single range-typed field
+ *  with the two given boundaries.

This doesn't seem to describe what the function is actually doing.

+ * This should ideally be done with RangeBound types on the right-hand-side
+ * created during range_split execution. Otherwise, we loose information about
+ * inclusive/exclusive bounds and infinity. We would need to implement btree
+ * operators for RangeBounds.

This seems like an idea for future improvement, but it's unclear to me
how the proposed idea is different from the state created by the
patch.

Also, materializing the slot to a heap tuple so that we can modify it
seems inefficient.  I wonder if we can't use a virtual slot here.

+ if (qual->opno == OID_RANGE_EQ_OP) {
+ Oid rngtypid;
+
+ // XXX PEMOSER Change opfamily and opfunc
+ qual->opfuncid = F_RANGE_CONTAINS; //<<--- opfuncid can be 0 during planning
+ qual->opno = OID_RANGE_CONTAINS_ELEM_OP; //OID_RANGE_CONTAINS_OP;
+ clause->isnormalize = true;
+
+ // Attention: cannot merge using non-equality operator 3890 <---
OID_RANGE_CONTAINS_OP
+ opfamily = 4103; //range_inclusion_ops from pg_opfamily.h
+
+ rngtypid = exprType((Node*)clause->lexpr->expr);
+ clause->range_typcache = lookup_type_cache(rngtypid, TYPECACHE_RANGE_INFO);
+ testmytypcache = clause->range_typcache;
+ } else {
+ clause->isnormalize = false;
+ }

This is pretty obviously a mess of hardcoded constants which are,
furthermore, not explained.  I can't tell whether this is intended as
a dirty hack to make some cases work while other things remain broken,
or whether you believe that OID_RANGE_EQ_OP.  If it's the latter, this
needs a much better explanation.  You probably realize that
"Attention: cannot merge using non-equality operator 3890" is not a
compelling justification, and that hard-coding all of these things
here doesn't look good.

In general, this patch needs both user-facing documentation and more
and better code comments.  I would suggest writing the user-facing
documentation soon.  It is pretty clear that you've got the basics of
this working, but it's impossible to understand what the semantics are
supposed to be except by staring at the code until you figure it out,
or running experiments.  People who are interested in this
functionality are more likely to provide useful feedback if there's
something they can read and go "yeah, that looks right" or "wait, that
sounds wrong."  Also, there may be places where, in the process of
documenting, you realize that things should

Re: Ordered Partitioned Table Scans

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 07:17, Robert Haas  wrote:
>
> On Wed, Dec 19, 2018 at 5:08 PM David Rowley
>  wrote:
> > With my idea for using live_parts, we'll process the partitions
> > looking for interleaved values on each query, after pruning takes
> > place. In this case, we'll see the partitions are naturally ordered. I
> > don't really foresee any issues with that additional processing since
> > it will only be a big effort when there are a large number of
> > partitions, and in those cases the planner already has lots of work to
> > do. Such processing is just a drop in the ocean when compared to path
> > generation for all those partitions.
>
> I agree that partitions_are_ordered() is cheap enough in this patch
> that it probably doesn't matter whether we cache the result.  On the
> other hand, that's mostly because you haven't handled the hard cases -
> e.g. interleaved list partitions.  If you did, then it would be
> expensive, and it probably *would* be worth caching the result.  Now
> maybe those hard cases aren't worth handling anyway.

I admit that I didn't understand the idea of the flag at the time,
having failed to see the point of it since if partitions are plan-time
pruned then I had thought the flag would be useless. However, as
Julien explained, it would be a flag of "Yes" means "Yes", okay to do
ordered scans, and "No" means "Recheck if there are pruned partitions
using only the non-pruned ones".   That seems fine and very sane to me
now that I understand it. FWIW, my moment of realisation came in [1].

However, my thoughts are that adding new flags and the live_parts
field in RelOptInfo raise the bar a bit for this patch.   There's
already quite a number of partition-related fields in RelOptInfo.
Understanding what each of those does is not trivial, so I figured
that this patch would be much easier to consider if I skipped that
part for the first cut version.   I feared a lot of instability of
what fields exist from Amit's planner improvement patches and I didn't
want to deal with dependencies from WIP. I had to deal with that last
year on run-time pruning and it turned out not to be fun.

> You also seem to be saying that since we run-time partitioning pruning
> might change the answer, caching the initial answer is pointless.  But
> I think Julien has made a good argument for why that's wrong: if the
> initial answer is that the partitions are ordered, which will often be
> true, then we can skip all later checks.
>
> So I am OK with the fact that this patch doesn't choose to cache it,
> but I don't really buy any of your arguments for why it would be a bad
> idea.

OK, good.  I agree.  For the record; I want to steer clear of the flag
in this first cut version, especially so now given what time it is.

[1] 
https://www.postgresql.org/message-id/cakjs1f_r51oapsn1oc4i36d7vznnihnk+1widfg0qrvb_eo...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: dropdb --force

2019-03-05 Thread Filip Rembiałkowski
Here is Pavel's patch rebased to master branch, added the dropdb
--force option, a test case & documentation.

I'm willing to work on it if needed. What are possible bad things that
could happen here? Is the documentation clear enough?

Thanks.


On Tue, Dec 18, 2018 at 4:34 PM Marti Raudsepp  wrote:
>
> Hi
>
> > út 18. 12. 2018 v 16:11 odesílatel Filip Rembiałkowski 
> >  napsal:
> >> Please share opinions if this makes sense at all, and has any chance
> >> going upstream.
>
> Clearly since Pavel has another implementation of the same concept,
> there is some interest in this feature. :)
>
> On Tue, Dec 18, 2018 at 5:20 PM Pavel Stehule  wrote:
> > Still one my customer use a patch that implement FORCE on SQL level. It is 
> > necessary under higher load when is not easy to synchronize clients.
>
> I think Filip's approach of setting pg_database.datallowconn='false'
> is pretty clever to avoid the synchronization problem. But it's also a
> good idea to expose this functionality via DROP DATABASE in SQL, like
> Pavel's patch, not just the 'dropdb' binary.
>
> If this is to be accepted into PostgreSQL core, I think the two
> approaches should be combined on the server side.
>
> Regards,
> Marti Raudsepp
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..84df485e11 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-DROP DATABASE [ IF EXISTS ] name
+DROP DATABASE [ IF EXISTS ] name [ FORCE ]
 
  
 
@@ -32,9 +32,11 @@ DROP DATABASE [ IF EXISTS ] name
DROP DATABASE drops a database. It removes the
catalog entries for the database and deletes the directory
containing the data.  It can only be executed by the database owner.
-   Also, it cannot be executed while you or anyone else are connected
-   to the target database.  (Connect to postgres or any
-   other database to issue this command.)
+   Also, it cannot be executed while you are connected to the target database.
+   (Connect to postgres or any other database to issue this
+   command.)
+   If anyone else is connected to the target database, this command will fail
+   - unless you use the FORCE option described below.
   
 
   
@@ -64,6 +66,24 @@ DROP DATABASE [ IF EXISTS ] name
  
 

+
+   
+FORCE
+
+ 
+  Attempt to terminate all existing connections to the target database.
+ 
+ 
+  This will fail, if current user has no permissions to terminate other
+  connections. Required permissions are the same as with 
+  pg_terminate_backend, described
+  in .
+
+  This will also fail, if the connections do not terminate in 60 seconds.
+ 
+
+   
+
   
  
 
diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml
index 38f38f01ce..365ba317c1 100644
--- a/doc/src/sgml/ref/dropdb.sgml
+++ b/doc/src/sgml/ref/dropdb.sgml
@@ -86,6 +86,20 @@ PostgreSQL documentation
   
  
 
+ 
+  -f
+  --force
+  
+   
+Force termination of connected backends before removing the database.
+   
+   
+This will add the FORCE option to the DROP
+DATABASE command sent to the server.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..2137bee3a7 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -487,7 +487,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 * potential waiting; we may as well throw an error first if we're gonna
 	 * throw one.
 	 */
-	if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts))
+	if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts, false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("source database \"%s\" is being accessed by other users",
@@ -777,7 +777,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */
 void
-dropdb(const char *dbname, bool missing_ok)
+dropdb(const char *dbname, bool missing_ok, bool force)
 {
 	Oid			db_id;
 	bool		db_istemplate;
@@ -785,6 +785,7 @@ dropdb(const char *dbname, bool missing_ok)
 	HeapTuple	tup;
 	int			notherbackends;
 	int			npreparedxacts;
+	int			loops = 0;
 	int			nslots,
 nslots_active;
 	int			nsubscriptions;
@@ -868,12 +869,33 @@ dropdb(const char *dbname, bool missing_ok)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, ¬herbackends, &npreparedxacts))
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("database \"%s\" is being accessed by other users",
-		dbname),
- errdetail_busy_db(notherbackends, npreparedxacts)));
+	for (;;)
+	{
+		/*
+		 * CountOtherDBBackends check usage of database by other backends and try
+		 * to wait 5 sec. We try to raise warning after 1 minute and and raise
+		 * a error 

  1   2   >