date:20181108

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Konstantin Knizhnik




On 07.11.2018 22:25, Tom Lane wrote:

Robert Haas  writes:

On Fri, May 11, 2018 at 11:01 AM, Simon Riggs  wrote:

I have no problem if you want to replace this with an even better
design in a later release.

Meh. The author / committer should get a patch into the right shape

They have done, at length. Claiming otherwise is just trash talk.

Some people might call it "honest disagreement".

So where we are today is that this patch was lobotomized by commits
77366d90f/d06fe6ce2 as a result of this bug report:

https://postgr.es/m/20181106185255.776mstcyehnc63ty@alvherre.pgsql

We need to move forward, either by undertaking a more extensive
clean-out, or by finding a path to a version of the code that is
satisfactory.  I wanted to enumerate my concerns while yesterday's
events are still fresh in mind.  (Andres or Robert might have more.)

* I do not understand why this feature is on-by-default in the first
place.  It can only be a win for expression indexes that are many-to-one
mappings; for indexes that are one-to-one or few-to-one, it's a pretty
big loss.  I see no reason to assume that most expression indexes fall
into the first category.  I suggest that the design ought to be to use
this optimization only for indexes for which the user has explicitly
enabled recheck_on_update.  That would allow getting rid of the cost check
in IsProjectionFunctionalIndex, about which we'd otherwise have to have
an additional fight: I do not like its ad-hoc-ness, nor the modularity
violation (and potential circularity) involved in having the relcache call
cost_qual_eval.

* Having heapam.c calling the executor also seems like a
modularity/layering violation that will bite us in the future.

* The implementation seems incredibly inefficient.  ProjIndexIsUnchanged
is doing creation/destruction of an EState, index_open/index_close
(including acquisition of a lock we should already have), BuildIndexInfo,
and expression compilation, all of which are completely redundant with
work done elsewhere in the executor.  And it's doing that *over again for
every tuple*, which totally aside from wasted cycles probably means there
are large query-lifespan memory leaks in an UPDATE affecting many rows.
I think a minimum expectation should be that one-time work is done only
one time; but ideally none of those things would happen at all because
we could share work with the regular code path.  Perhaps it's too much
to hope that we could also avoid duplicate computation of the new index
expression values, but as long as I'm listing complaints ...

* As noted in the bug thread, the implementation of the new reloption
is broken because (a) it fails to work for some built-in index AMs
and (b) by design, it can't work for add-on AMs.  I agree with Andres
that that's not acceptable.

* This seems like bad data structure design:

+   Bitmapset  *rd_projidx; /* Oids of projection indexes */

That comment is a lie, although IMO it'd be better if it were true;
a list of target index OIDs would be a far better design here.  The use
of rd_projidx as a set of indexes into the relation's indexlist is
inefficient and overcomplicated, plus it creates an unnecessary and not
very safe (even if it were documented) coupling between rd_indexlist and
rd_projidx.

* Having ProjIndexIsUnchanged examine relation->rd_projidx directly is
broken by design anyway, both from a modularity standpoint and because
its inner loop involves catalog accesses that could result in relcache
flushes.  I'm somewhat amazed that the regression tests passed on
CLOBBER_CACHE_ALWAYS buildfarm animals, although possibly this is
explained by the fact that they're only testing cases with a single
expression index, so that the bitmap isn't checked again after the cache
flush happens.  Again, this could be fixed if what was returned was just
a list of relevant index OIDs.

* This bit of coding is unsafe, for the reason explained in the existing
comment:

  /*
   * Now save copies of the bitmaps in the relcache entry.  We intentionally
   * set rd_indexattr last, because that's the one that signals validity of
   * the values; if we run out of memory before making that copy, we won't
   * leave the relcache entry looking like the other ones are valid but
   * empty.
   */
  oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
  relation->rd_keyattr = bms_copy(uindexattrs);
  relation->rd_pkattr = bms_copy(pkindexattrs);
  relation->rd_idattr = bms_copy(idindexattrs);
  relation->rd_indexattr = bms_copy(indexattrs);
+relation->rd_projindexattr = bms_copy(projindexattrs);
+relation->rd_projidx = bms_copy(projindexes);
  MemoryContextSwitchTo(oldcxt);

* There's a lot of other inattention to comments.  For example, I noticed
that this patch added new responsibilities to RelationGetIndexList without
updating its header comment to mention them.

* There's a lack of attention to terminology, too.  I do not think that
"projection index

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Haribabu Kommi

On Thu, Nov 8, 2018 at 4:53 PM Amit Kapila  wrote:

> On Sun, Nov 4, 2018 at 6:37 AM Haribabu Kommi 
> wrote:
> >
> > On Sun, Nov 4, 2018 at 11:17 AM Michael Paquier 
> wrote:
> >>
> >> On Sat, Nov 03, 2018 at 03:56:14PM +0530, Amit Kapila wrote:
> >> > Before trying out any solution or deciding which is better, I think we
> >> > want to understand why the variability in results occurred only after
> >> > your patch?  Without the patch, it works just fine.
> >>
> >> Good point.  We surely want to have a stable feature, which gets tested
> >> without triggering random failures in the builfarm.
> >
> >
> > Thanks for the review.
> >
> > This patch has changed the pg_stat_statements_reset() function from
> returning void
> > to number statements that it reset.
> >
>
> What is the motivation of that change?  It seems to be
> backward-incompatible change.  I am not telling we can't do it, but do
> we have strong justification?  One reason, I could see is to allow the
> user to get the exact value of statements that got reset and maybe
> that is more helpful as we have now additional parameters in the API,
> but I am not sure if that is a good justification.
>

Yes, as you said that is the main reason to modify the function to return
the number of statements that it reset. Without having any output from
the function, there is no way that how many statements that the above
function reset. Earlier it used to reset every statement, so I feel there
is
no need of any output, but I feel giving the number of statements with
this approach is good.


> > The regression test contains pg_stat_statements_reset()
> > as first query to reset any of the query stats that are already tracked
> to let the test to
> > provide the proper results. But with this feature, if we test this
> regression test on an
> > already running server, the first query result is varying and it leads
> to test failure.
> >
> > So to fix this problem, I added a wrapper function that masks the result
> of the
> > pg_stat_statements_reset() function and just return as void, with this
> wrapper function
> > used a first statement, the test is stable, as this function takes care
> of resetting already
> > existing statements from the already running server.
> >
>
> Or you could change the query to something like (Select NULL from
> pg_stat_statements_reset())
>

Thanks. I changed it accordingly.


> > With the above change, the regression test is stable. Comments?
> >
>
> In the comment message, you wrote: "Backward Compatible change, Now
> pg_stat_statements_reset() function returns the number of statement
> entries that are reset."
>
> Do you want to say incompatible instead of compatible?
>

Yes, your are correct.


> +  If no parameter is specified or specify everything as
> NULL(invalid),
> +  it will discard all statistics.
> +  By default, this function can only be executed by superusers.
> Access may
> +  be granted to others using GRANT.
>
> I think it will look better if you can continue the "By default, ..."
> line from where the previous line ends rather than starting a new
> line.
>

Updated as per your suggestion.
Attached an updated with above comments correction.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-pg_stat_statements_reset-to-reset-specific-query-use_v8.patch
Description: Binary data

Re: ON COMMIT actions and inheritance

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 04:46:46PM +0900, Amit Langote wrote:
> How about:
> When used on tables with inheritance children (including partitioned
> tables), this also drops the children (partitions).

Even if the style gets heavier, I have also the following in my box:
When used on a partitioned table, this action drops its partitions and
when used on tables with inheritance children, it drops the depending
children.
--
Michael


signature.asc
Description: PGP signature

Re: ON COMMIT actions and inheritance

2018-11-08 Thread Amit Langote

On 2018/11/08 18:03, Michael Paquier wrote:
> On Thu, Nov 08, 2018 at 04:46:46PM +0900, Amit Langote wrote:
>> How about:
>> When used on tables with inheritance children (including partitioned
>> tables), this also drops the children (partitions).
> 
> Even if the style gets heavier, I have also the following in my box:
> When used on a partitioned table, this action drops its partitions and
> when used on tables with inheritance children, it drops the depending
> children.

I think we don't need to say "depending" children here, even though I know
the dependency management system is involved internally.

In the end, I will let you go with whatever you think is clearer. :)

Thanks,
Amit

Re: New vacuum option to do only freezing

2018-11-08 Thread Masahiko Sawada

On Tue, Nov 6, 2018 at 2:48 AM Robert Haas  wrote:
>
> On Mon, Nov 5, 2018 at 3:12 AM Masahiko Sawada  wrote:
> > Adding a field-and-value style option might be worth. Or maybe we can
> > add one option for example freeze_without_index_cleanup?
>
> That seems non-orthogonal.  We have an existing flag to force freezing
> (FREEZE); we don't need a second option that does the same thing.
> Skipping the index scans (and thus necessarily the second heap pass)
> is a separate behavior which we don't currently have a way to control.
>

We already have disable_page_skipping option, not (page_skipping
false). So ISMT disable_index_cleanup would be more natural. Also,
since what to do with this option is not only skipping vacuum indexes
but also skipping removeing dead tuples on heap, I think that the
option should have a more understandable name for users indicating
that both it removes dead tuples less than the normal vacuum and it's
aimed to freeze tuple more faster. Of course we document them, though.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Amit Kapila

On Thu, Nov 8, 2018 at 1:52 PM Haribabu Kommi  wrote:
> On Thu, Nov 8, 2018 at 4:53 PM Amit Kapila  wrote:
>> > This patch has changed the pg_stat_statements_reset() function from 
>> > returning void
>> > to number statements that it reset.
>> >
>>
>> What is the motivation of that change?  It seems to be
>> backward-incompatible change.  I am not telling we can't do it, but do
>> we have strong justification?  One reason, I could see is to allow the
>> user to get the exact value of statements that got reset and maybe
>> that is more helpful as we have now additional parameters in the API,
>> but I am not sure if that is a good justification.
>
>
> Yes, as you said that is the main reason to modify the function to return
> the number of statements that it reset. Without having any output from
> the function, there is no way that how many statements that the above
> function reset. Earlier it used to reset every statement, so I feel there is
> no need of any output, but I feel giving the number of statements with
> this approach is good.
>

Sure, but what are we going to achieve with that number?  What
information user is going to get by that?  If it can help us to ensure
that it has reset the expected number of statements, then I can see
the clear usage, but without that, the return value doesn't seem to
have any clear purpose.  So, I don't see much value in breaking
compatibility.

Does anyone else have an opinion on this matter?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PostgreSQL Limits and lack of documentation about them.

2018-11-08 Thread Peter Eisentraut

On 08/11/2018 04:13, David Rowley wrote:
> I added something along those lines in a note below the table. Likely
> there are better ways to format all this, but trying to detail out
> what the content should be first.
> 
> Hopefully I I've addressed the other things mentioned too.

Could you adjust this to use fewer capital letters, unless they start
sentences or similar?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Arthur Zakirov


On 07.11.2018 20:11, Pavel Stehule wrote:
st 7. 11. 2018 v 15:11 odesílatel Arthur Zakirov 
mailto:a.zaki...@postgrespro.ru>> napsal:

> I think there is lack of necessary braces here for first if and second
> else branches. This is true for both patches.
?


I just meant something like this (additional "{", "}" braces):

if (is_absolute_path(host))
{
...
}
else
{
...
}

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

A typo in partprune.c

2018-11-08 Thread Yuzuko Hosoya

Hi,

I found a typo in the comment of get_matching_range_bounds.
/*
- * get_matching_range_datums
+ * get_matching_range_bounds
  * Determine the offsets of range bounds matching the specified 
values,
  * according to the semantics of the given operator strategy

Here is the patch to fix it.

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center



fix_partprune_typo.patch
Description: Binary data

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Fabien COELHO




Hello Robert,


 psql> \conninfo
 You are connected to database "fabien" as user "fabien" on host "foo" at port 
"5432".


I remain of the opinion that this is not a bug.  You told it that foo 
has address 127.0.0.1 and it believed you; that's YOUR fault.


Hmmm. For me, if a user asks \conninfo for connection information, they 
expect to be told what the connection actually is, regardless of the 
initial connection string.


Another more stricking instance:

  sh> psql "host=/tmp port=5432 hostaddr=127.0.0.1"
  ...
  fabien=# \conninfo
  You are connected to database "fabien" as user "fabien" via socket in "/tmp" at port 
"5432".
  SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 
256, compression: off)

It says that there is a socket, but there is none. The SSL bit is a 
giveaway, there is no SSL on Unix-domain sockets.



 sh> psql "host=foo hostaddr=127.0.0.1"

 psql> \conninfo
 You are connected to database "fabien" as user "fabien" on host "foo" (address 
"127.0.0.1") at port "5432".


Nevertheless, that seems like a reasonable change to the output.  Will 
your patch show the IP address in all cases or only when hostaddr is 
specified?


It is always printed, unless both host & address are equal.

The rational is that it is also potentially useful for multi-ip dns 
resolutions, and generating a valid hostaddr allows \connect defaults to 
reuse the actual same connection, including the IP that was chosen.


Also, the added information is quite short, and if a user explicitely asks 
for connection information, I think they can handle the slightly expanded 
answer.


--
Fabien.

Re: Resetting PGPROC atomics in ProcessInit()

2018-11-08 Thread Amit Kapila

On Sun, Nov 4, 2018 at 6:30 AM Robert Haas  wrote:
>
> On Sat, Oct 27, 2018 at 6:41 AM Andres Freund  wrote:
> > I just noticed, while working on a patch adding things to PGPROC, that
> > the group clearning patches for the proc array and clog reset atomics in
> > InitProcess().
> >
> > I'm not a big fan of that, because it means that it's not safe to look
> > at the atomics of backends that aren't currently in use.  Is there any
> > reason to not instead initialize them in InitProcGlobal() and just
> > assert in InitProcess() that they're 0?  If they're not, we'd be in deep
> > trouble anyway, no?
>
> I think you are correct.  I think it would be better in general for
> InitProcess() to Assert() rather than reinitializing.
>

Okay, changed the code as per Andres's and your suggestion.  Do you
think the attached change makes sense? I think we should backpatch
this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_atomic_init_group_clear_v1.patch
Description: Binary data

Re: A typo in partprune.c

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 07:19:14PM +0900, Yuzuko Hosoya wrote:
> Here is the patch to fix it.

Thanks, committed.
--
Michael


signature.asc
Description: PGP signature

Re: jsonpath

2018-11-08 Thread Tomas Vondra





On 11/8/18 6:12 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> BTW is the v19 really just a rebase of the preceding version?
>> I'm asking because v18 was adding two types into pg_type.dat, namely
>> jsonpath (6050) and _jsonpath (6051), while v19 only adds jsonpath
>> (6050).
> 
> I haven't looked at this patch, but manual generation of array type entries
> is obsolete since 3dc820c43e427371b66d217f2bd5481fc9ef2e2d.  There should
> still be a mention of the OID to use for the array type though ...
> 

Ah, I missed this improvement somehow. (Yes, the array type OID is still
mentioned as part of the type definition.)

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-08 Thread David Rowley

On 8 November 2018 at 20:15, Amit Langote  wrote:
> Actually, as I also proposed upthread, we should move root_tuple_slot from
> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
> it's part of the first step described above that has nothing to do with
> partition tuple routing proper.  We can make PartitionTupleRouting private
> to execPartition.c if we do that.

okay. Makes sense. I've changed things around to PartitionTupleRouting
is now private to execPartition.c

> I didn't find anything quite significant to complain about, except just
> one line:
>
> + * Initially we must only setup 1 PartitionDispatch object; the one for
>
> setup -> set up

Changed too.

I've attached v15 and a delta from v14 to ease re-review.

I also ran pgindent on this version. That's not part of the delta but
is in the main patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v14_v15.diff
Description: Binary data


v15-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
Description: Binary data

Re: Special role for subscriptions

2018-11-08 Thread Evgeniy Efimkin

Hi!
I think we can add FOR TABLES clause for create/refresh subscription, for 
example: CREATE SUBSCRIPTION my_sub CONNECTION ... PUBLICATION my_pub [WITH 
...] [ FOR TABLES t1, t2 | ALL TABLES ]. ALL TABLES is avalibale only for 
superuser. FOR TABLES t1, t2 is available to owner of tables and superuser.

07.11.2018, 00:52, "Stephen Frost" :
> Greetings,
>
> * Evgeniy Efimkin (efim...@yandex-team.ru) wrote:
>>  As a first step I suggest we allow CREATE SUBSCRIPTION for table owner only.
>
> That's a nice idea but seems like we would want to have a way to filter
> what tables a subscription follows then..? Just failing if the
> publication publishes tables that we don't have access to or are not the
> owner of seems like a poor solution..
>
> Thanks!
>
> Stephen

Ефимкин Евгений

RE: A typo in partprune.c

2018-11-08 Thread Yuzuko Hosoya

Hi Michael,

> From: Michael Paquier [mailto:mich...@paquier.xyz]
> Sent: Thursday, November 08, 2018 8:17 PM
> 
> On Thu, Nov 08, 2018 at 07:19:14PM +0900, Yuzuko Hosoya wrote:
> > Here is the patch to fix it.
> 
> Thanks, committed.

Thank you.

Yuzuko Hosoya
NTT Open Source Software Center

Re: shared-memory based stats collector

2018-11-08 Thread Kyotaro HORIGUCHI

Thank you for the comments.

This message contains the whole refactored patch set.

At Mon, 29 Oct 2018 15:10:10 +0100, Antonin Houska  wrote in 
<28855.1540822210@localhost>
> Kyotaro HORIGUCHI  wrote:
> 
> > This is more saner version of previous v5-0008, which didn't pass
> > regression test. v6-0008 to v6-0010 are attached and they are
> > applied on top of v5-0001-0007.
> > 
> > - stats collector has been removed.
> > 
> > - modified dshash further so that deletion is allowed during
> >   sequential scan.
> > 
> > - I'm not sure about the following existing comment at the
> >   beginning of pgstat.c
> > 
> >   * - Add a pgstat config column to pg_database, so this
> >   *   entire thing can be enabled/disabled on a per db basis.
> 
> Following is the next handful of my comments:
> 
> * If you remove the stats collector, I think the remaining code in pgstat.c
>   does no longer fit into the backend/postmaster/ directory.

I didn't condier that, but I still don't find nice place among
existing directories. backend/statistics may be that but I feel
somewhat uneasy.. Finally I split pgstat.c into two files
(suggested in the file comment) and put both of them into a new
directory backend/statmon. One part is backend status faclity,
named bestatus (BackEnd Status) and one is pgstat, access
statistics part. The last 0008 patch does that. I tried to move
it ealier but it was a bit tough.

> * I'm not sure it's o.k. to call pgstat_write_statsfiles() from
>   postmaster.c:reaper(): the function can raise ERROR (I see at least one code
>   path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
>   a signal handler, it's hard to imagine the consequences. Maybe a reason to
>   leave some functionality in a separate worker, although the "stats
>   collector" would have to be changed.

I was careless about that. longjmp() in signal handler is
inhibited so we mustn't emit ERROR there. In the first place they
are placed in wrong places from several perspective. I changed
the way load and store.  In the attached patch (0003), load is
performed on the way initializing shared memory on postmaster and
storing is done in shutdown hook on postmaster. Since the shared
memory area is inherited to all children so no process actually
does initial attaching any longer. Addition to that archiver
process became an auxiliary process (0004) since it writes to the
shared stats.

> * Question still remains whether all the statistics should be loaded into
>   shared memory, see the note on paging near the bottom of [1].

Even counting possible page-in latency on stats writing, I agree
to what Robert said in the message that we will win by average
regarding to users who don't create so many databases.  If some
stats page to write were paged-out, also related heap page would
have been evicted out from shared buffers (or the buffer page
itself may have been paged-out) and every resources that can be
stashed out may be stashed out. So I don't think it becomes a
serious problem. On reading stats, it is currently reading a file
and sometimes waiting for maiking a up-to-date file. I think we
are needless to say about the case.

For cluster with many-many databases, a backend running on a
database will mainly see only stats for the current database (and
about shared tables) we can split stats by that criteria in the
next step.

> * if dshash_seq_init() is passed consistent=false, shouldn't we call
>   ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
>   needs to access the next partition and the old partition lock got released,
>   the table can be resized before the next partition lock is acquired, and
>   thus the backend-local copy of buckets becomes obsolete.

Oops. You're right.  Addition to that, resizing can happen while
dshash_seq_next moves the lock to the next partition. Resizing
happens on the way breaks sequential scan semantics. I added
ensure_valid_bucket_pointers() after the initial acquisition of
partition lock and move the lock seamlessly during scan. (0001)

> * Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
>   to be used in the current patch version.

Thanks. This is not used since we concluded that we no longer
need strict consistency in stats numbers. Removed. (0003)

> * pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

Bgwriter and checkpointer waited for postmaster's loading of
stats files. But I changed the startup sequence (as mentioned
above), so the wait became useless. Most of them are reaplced
with Assert. (0003)

> * pgstat_get_db_entry(): "return false" should probably be "return NULL".

I don't find that. (Isn't it caught by compiler?) Maybe it is
"found = false"?  (it might be a bit tricky)

> * Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
>   couldn't find a place where it's value is tested.

Thank you for fiding that. As pointed, PGSTAT_TABLE_WRITE is
finally not used since WRI

Re: PostgreSQL Limits and lack of documentation about them.

2018-11-08 Thread David Rowley

On 8 November 2018 at 22:46, Peter Eisentraut
 wrote:
> Could you adjust this to use fewer capital letters, unless they start
> sentences or similar?

Yeah. Changed in the attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v4-0001-Add-documentation-section-appendix-detailing-some.patch
Description: Binary data

Re: Connection limit doesn't work for superuser

2018-11-08 Thread Laurenz Albe

Tomas Vondra wrote:
> On 11/7/18 5:19 PM, Tom Lane wrote:
> > I think this proposal boils down to asking for support for an
> > incredibly bad application design, and equipping every database with
> > an additional foot-gun in order to have that.
> 
> I'm not sure about that. IMHO being able to restrict the number of
> superuser connections can be used to force users to use regular
> (non-superuser) roles for stuff that does not require that. Which should
> encourage better application design.
> 
> Of course, the question is whether such limit can actually be enforced
> reliably (I mean, can't the superuser simply change it?) and whether
> handing over superuser accounts to application users is a good idea in
> general ...

None of these arguments for enforcing a connection limit for superusers
sound terribly compelling to me.

He who is tempted to run his application with a superuser for
simplicitly's sake will not be the person to set a connection limit
for superusers.

I concur with Tom that this will do more harm than good.

Yours,
Laurenz Albe

Re: valgrind issues on Fedora 28

2018-11-08 Thread Tomas Vondra

On 11/6/18 6:51 PM, Tomas Vondra wrote:
> On 11/6/18 6:35 PM, Alvaro Herrera wrote:
>> On 2018-Nov-06, Tomas Vondra wrote:
>>
>>> Hi,
>>>
>>> I've recently updated to Fedora 28, and in that environment I get
>>> quite a
>>> few new valgrind issues (see the attached log).
>>
>> Hmm, related to
>> https://postgr.es/m/20180220150838.GD18315@e733.localdomain ?
>> Apparently the right answer is to add local suppressions ...
>>
> 
> Yep, seems like that. I'll check if the patch proposed by Aleksander
> resolves all the issues reported by valgrind.
> 

OK, I propose we add suppressions per the attached patch.

The rules are clearly incomplete, due to dependence on optimization
used. For example now the stack includes __wcsnlen_avx2, but I see it
was __wcsnlen_sse4_1 in Aleksander's report in February.

Not sure what to do about this - maybe we should wildcard this first
frame, to accept all wcsnlen variants. Opinions?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index 2f3b602773..1c8311338c 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -213,3 +213,39 @@
Memcheck:Cond
fun:PyObject_Realloc
 }
+
+# wcsrtombs uses some clever optimizations internally, which to valgrind
+# may look like access to uninitialized data. For example AVX2 instructions
+# load data in 256-bit chunks, irrespectedly of wchar length. gconv does
+# somethink similar by loading data in 32bit chunks and then shifting the
+# data internally. Neither of those actually uses the uninitialized part
+# of the buffer, as far as we know.
+#
+# https://www.postgresql.org/message-id/90ac0452-e907-e7a4-b3c8-15bd33780...@2ndquadrant.com
+
+{
+   wcsnlen_optimized
+   Memcheck:Cond
+   fun:__wcsnlen_avx2
+   fun:wcsrtombs
+   fun:wcstombs
+   fun:wchar2char
+}
+
+{
+   wcsnlen_optimized_addr32
+   Memcheck:Addr32
+   fun:__wcsnlen_avx2
+   fun:wcsrtombs
+   fun:wcstombs
+   fun:wchar2char
+}
+
+{
+   gconv_transform_internal
+   Memcheck:Cond
+   fun:__gconv_transform_internal_utf8
+   fun:wcsrtombs
+   fun:wcstombs
+   fun:wchar2char
+}

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Laurenz Albe

Tom Lane wrote:
> I wanted to enumerate my concerns while yesterday's
> events are still fresh in mind.  (Andres or Robert might have more.)
> 
> * I do not understand why this feature is on-by-default in the first
> place.  It can only be a win for expression indexes that are many-to-one
> mappings; for indexes that are one-to-one or few-to-one, it's a pretty
> big loss.  I see no reason to assume that most expression indexes fall
> into the first category.  I suggest that the design ought to be to use
> this optimization only for indexes for which the user has explicitly
> enabled recheck_on_update.  That would allow getting rid of the cost check
> in IsProjectionFunctionalIndex, about which we'd otherwise have to have
> an additional fight: I do not like its ad-hoc-ness, nor the modularity
> violation (and potential circularity) involved in having the relcache call
> cost_qual_eval.

That was my impression too when I had a closer look at this feature.

What about an option "hot_update_check" with values "off" (default),
"on" and "always"?

Yours,
Laurenz Albe

Re: Connection slots reserved for replication

2018-11-08 Thread Kyotaro HORIGUCHI

Hello.

At Wed, 7 Nov 2018 19:31:00 +0900, Masahiko Sawada  
wrote in 
> On Tue, Nov 6, 2018 at 9:16 PM Kyotaro HORIGUCHI
>  wrote:
> > InitializeMaxBackends()
> > MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
> > -   max_worker_processes;
> > +   max_worker_processes + replication_reserved_connections;
> >
> > This means walsender doesn't comsume a connection, which is
> > different from the current behavior. We should reserve a part of
> > MaxConnections for walsenders.  (in PostmasterMain,
> > max_wal_senders is counted as a part of MaxConnections)
> 
> Yes. We can force replication_reserved_connections <= max_wal_senders
> and then reserved connections for replication should be a part of
> MaxConnections.
> 
> >
> > +   if (am_walsender && replication_reserved_connections < 
> > max_wal_senders
> > +   && *procgloballist == NULL)
> > +   procgloballist = &ProcGlobal->freeProcs;
> >
> > Currently exccesive number of walsenders are rejected in
> > InitWalSenderSlot and emit the following error.
> >
> > >ereport(FATAL,
> > >(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
> > > errmsg("number of requested standby connections "
> > >"exceeds max_wal_senders (currently %d)",
> > >max_wal_senders)));
> >
> > With this patch, if max_wal_senders =
> > replication_reserved_connections = 3 and the fourth walreceiver
> > comes, we will get "FATAL: sorry, too many clients already"
> > instead. It should be fixed.
> >
> > When r_r_conn = 2 and max_wal_senders = 3 and the three
> > walsenders are active, in an exreme case where a new replication
> > connection comes at the same time another is exiting, we could
> > end up using two normal slots despite that one slot is vacant in
> > reserved slots.
> 
> Doesn't the max_wal_senders prevent the case?

Currently the variable doesn't work as so. We once accept the
connection request and searches for a vacant slot in
InitWalSenderSlot and reject the connection if it found that no
room is available. Even with this patch, we don't count the
accurate number of active walsenders (for performance reason). If
reserved slot are filled, there's no way other than to accept the
connection using non-reserved slot if r_r_conn <
max_wal_senders. If one of active walsenders went away since we
allocated non-reserved connection slot until InitWalSenderSlot
starts searching sendnds[] array. Finally the new walsender on
the unreserved slot is activated, and one reserved slot is left
empty. So this is "an extreme case". We could ignore the case.

I'm doubt that we should allow the setting where r_r_conn <
max_wal_senders, or even r_r_conn != max_wal_senders. We don't
have a problem like this if we don't allow the cases.

> Wal senders can get connection if we have free procs more than
> (MaxConnections - reserved for superusers). So I think for normal
> users the connection must be refused if (MaxConnections - (reserved
> for superuser and replication) > # of freeprocs) and for wal senders
> the connection also must be refused if (MaxConnections - (reserved for
> superuser) > # of freeprocs). I'm not sure we need such trick in
> InitWalSenderSlot().

(For clarity, I don't mean my previous patch is good solution.)

It works as far as we accept that some reserved slots can be left
unused despite of some walsenders are using normal slots. (Just
exiting a walsender using reserved slot causes this but it is
usually occupied by walsenders comes later)

Another idea is we acquire a walsnd[] slot before getting a
connection slot..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Arthur Zakirov wrote:

> On 07.11.2018 20:11, Pavel Stehule wrote:
> > st 7. 11. 2018 v 15:11 odesílatel Arthur Zakirov
> > mailto:a.zaki...@postgrespro.ru>> napsal:
> > > I think there is lack of necessary braces here for first if and second
> > > else branches. This is true for both patches.
> > ?
> 
> I just meant something like this (additional "{", "}" braces):

We omit braces when there's an individual statement.  (We do add the
braces when we have a comment atop the individual statement, though, to
avoid pgindent from doing a stupid thing.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-08 Thread Etsuro Fujita


(2018/11/08 10:50), Thomas Munro wrote:

On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
  wrote:

(2018/11/07 9:22), Kyotaro HORIGUCHI wrote:

At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro Fujita   
wrote in<5be18409.2070...@lab.ntt.co.jp>

(2018/11/06 12:53), Kyotaro HORIGUCHI wrote:

even if it comes from another pipe, we can assume that the
SIGPIPE immediately stopped the program before returning any
garbage to us.


Sorry, I don't understand this.


Mmm. It looks confusing, sorry. In other words:

We don't care the reason for program's ending if we received
enough data from the program and we actively (= before receiving
EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
or something fatal happened on the program before we end reading,
we receive EOF just after that and we are interested in the cause
of the program's death.

# Does the above make sense?


Yeah, thanks for the explanation!


+1

I take back what I said earlier about false positives from other
pipes.  I think it's only traditional Unix programs designed for use
in pipelines and naive programs that let SIGPIPE terminate the
process.   The accepted answer here gives a good way to think about
it:

https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist


Thanks for the information!


A program sophisticated enough to be writing to other pipes is no
longer in that category and should be setting up signal dispositions
itself, so I agree that we should enable the default disposition and
ignore WTERMSIG(exit_code) == SIGPIPE, as proposed.  That is pretty
close to the intended purpose of that signal AFAICS.


Great!


In the sense of "We don't care the reason", negligible reasons
are necessariry restricted to SIGPIPE, evan SIGSEGV could be
theoretically ignored safely. "theoretically" here means it is
another issue whether we rely on the output from a program which
causes SEGV (or any reason other than SIGPIPE, which we caused).


For the SIGSEGV case, I think it would be better that we don't rely on
the output data, IMO, because I think there might be a possibility that
the program have generated that data incorrectly/unexpectedly.


+1

I don't think we should ignore termination by signals other than
SIGPIPE: that could hide serious problems from users.  I want to know
if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
happens after we read enough data; there is a major problem that a
human needs to investigate!


I think so too.


Finally in the attached PoC, I set cstate->eof_reached on failure
of NextCopyFromRawFields then if eof_reached we don't ingore
SIGPIPE. For a program like
"puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
the following behavior. (I got an error without the fflush() but
it is inevitable).

=# select * from ft2 limit 1;
  a
---
test1

=# select * from ft2 limit 2;
ERROR:  program "/home/horiguti/work/exprog" failed
DETAIL:  child process was terminated by signal 13: Broken pipe

For the original case:

=# select * from ft1 limit 1;
 a
--
test
=# select * from ft1 limit 2;
 a
--
test
(1 row)


I didn't confirmed whether it is complete.


Sorry, I don't understand this fully, but the reason to add the
eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
COPY FROM PROGRAM cases?


Yes, if we have received EOF, the program ended before we
finished reading from the pipe thus any error should be
reported.  Elsewise we don't care at least SIGPIPE. In the
attached patch all kind of errors are ignored when pipe is closed
from reading-end on backends.


OK, I understand your idea.  Thanks for the patch!


+1


As the result it doesn't report an error for SELECT * FROM ft2
LIMIT 1 on "main(void){puts("test1"); return 1;}".

=#  select * from ft limit 1;
 a
---
   test1
(1 row)

limit 2 reports the error.

=# select * from ft limit 2;
ERROR:  program "/home/horiguti/work/exprog" failed
DETAIL:  child process exited with exit code 1


I think this would be contrary to users expectations: if the SELECT
command works for limit 1, they would expect that the command would work
for limit 2 as well.  So, I think it would be better to error out that
command for limit 1 as well, as-is.


I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
error.  For LIMIT 1, we got all the rows we wanted, and then we closed
the pipe.  If we got a non-zero non-signal exit code, or a signal exit
code and it was SIGPIPE (not any other signal!), then we should
consider that to be expected.


Maybe I'm missing something, but the non-zero non-signal exit code means 
that there was something wrong with the called program, so I think a 
human had better investigate that as well IMO, which would probably be a 
minor problem, though.  Too restrictive?



I tried to think of a scenario where the already-received output is
truly invalidated by a later error that happens after we close the
pipe.  It could be something involving a progr

Re: pg_dump multi VALUES INSERT

2018-11-08 Thread Surafel Temesgen

On Tue, Nov 6, 2018 at 8:18 PM Alvaro Herrera 
wrote:

> On 2018-Nov-06, Surafel Temesgen wrote:
>
> > hi,
> >
> > On Sun, Nov 4, 2018 at 1:18 PM Fabien COELHO 
> wrote:
> >
> > > Patch does not seem to apply anymore, could you rebase?
> > >
> > The attached patch is a rebased version and work by ‘inserts=100’ as
> > Stephen suggest
>
> I thought the suggestion was that the number could be any positive
> integer, not hardcoded 100.

ok

> It shouldn't take much more code to handle
> it that way, which makes more sense to me
>

yes its not much line of code. Attach is a patch that optionally accept the
number of row in a single insert statement and if it is not specified one
row per statement used

regards

Surafel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index b5fa4fb85c..70411cb6ac 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -780,7 +780,10 @@ PostgreSQL documentation
 non-PostgreSQL databases.
 However, since this option generates a separate command for each row,
 an error in reloading a row causes only that row to be lost rather
-than the entire table contents.
+than the entire table contents. The number of row per insert statement
+can also be specified to make the dump file smaller and faster
+to reload but lack single row data lost on error while reloading rather entire affected
+insert statement data lost.
 Note that
 the restore might fail altogether if you have rearranged column order.
 The --column-inserts option is safe against column
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index ba798213be..c5108ff6d8 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -72,6 +72,7 @@ typedef struct _restoreOptions
 	int			dropSchema;
 	int			disable_dollar_quoting;
 	int			dump_inserts;
+	int			dump_inserts_multiple;
 	int			column_inserts;
 	int			if_exists;
 	int			no_comments;	/* Skip comments */
@@ -145,6 +146,7 @@ typedef struct _dumpOptions
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
 	int			dump_inserts;
+	int			dump_inserts_multiple;
 	int			column_inserts;
 	int			if_exists;
 	int			no_comments;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index c8d01ed4a4..a5fa4917a8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -359,7 +359,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"if-exists", no_argument, &dopt.if_exists, 1},
-		{"inserts", no_argument, &dopt.dump_inserts, 1},
+		{"inserts", optional_argument, NULL, 8},
 		{"lock-wait-timeout", required_argument, NULL, 2},
 		{"no-tablespaces", no_argument, &dopt.outputNoTablespaces, 1},
 		{"quote-all-identifiers", no_argument, "e_all_identifiers, 1},
@@ -562,6 +562,20 @@ main(int argc, char **argv)
 dosync = false;
 break;
 
+			case 8:			/* inserts values number */
+if (optarg)
+{
+	dopt.dump_inserts_multiple = atoi(optarg);
+	if (dopt.dump_inserts_multiple < 0)
+	{
+		write_msg(NULL, "insert values must be positive number\n");
+		exit_nicely(1);
+	}
+}
+else
+	dopt.dump_inserts = 1;
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -609,7 +623,7 @@ main(int argc, char **argv)
 		exit_nicely(1);
 	}
 
-	if (dopt.dump_inserts && dopt.oids)
+	if ((dopt.dump_inserts || dopt.dump_inserts_multiple) && dopt.oids)
 	{
 		write_msg(NULL, "options --inserts/--column-inserts and -o/--oids cannot be used together\n");
 		write_msg(NULL, "(The INSERT command cannot set OIDs.)\n");
@@ -619,7 +633,8 @@ main(int argc, char **argv)
 	if (dopt.if_exists && !dopt.outputClean)
 		exit_horribly(NULL, "option --if-exists requires option -c/--clean\n");
 
-	if (dopt.do_nothing && !(dopt.dump_inserts || dopt.column_inserts))
+	if (dopt.do_nothing && !(dopt.dump_inserts || dopt.column_inserts ||
+		dopt.dump_inserts_multiple))
 		exit_horribly(NULL, "option --on-conflict-do-nothing requires option --inserts or --column-inserts\n");
 
 	/* Identify archive format to emit */
@@ -889,6 +904,7 @@ main(int argc, char **argv)
 	ropt->use_setsessauth = dopt.use_setsessauth;
 	ropt->disable_dollar_quoting = dopt.disable_dollar_quoting;
 	ropt->dump_inserts = dopt.dump_inserts;
+	ropt->dump_inserts_multiple = dopt.dump_inserts_multiple;
 	ropt->no_comments = dopt.no_comments;
 	ropt->no_publications = dopt.no_publications;
 	ropt->no_security_labels = dopt.no_security_labels;
@@ -2073,6 +2089,193 @@ dumpTableData_insert(Archive *fout, void *dcontext)
 	return 1;
 }
 
+/*
+ * Dump table data using multiple values INSERT commands.
+ */
+static int
+dumpTableData_insert_multiple(Archive *fout, void *dcontext)
+{
+	TableDataInfo *tdinfo = (TableDataI

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Amit Kapila wrote:

> Sure, but what are we going to achieve with that number?  What
> information user is going to get by that?  If it can help us to ensure
> that it has reset the expected number of statements, then I can see
> the clear usage, but without that, the return value doesn't seem to
> have any clear purpose.  So, I don't see much value in breaking
> compatibility.
> 
> Does anyone else have an opinion on this matter?

This was proposed by Sergei Kornilov in
https://postgr.es/m/3368121530260...@web21g.yandex.ru saying that "it
would be nice" to return it.  Maybe he has an use case in mind?  I don't
see one myself.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: New vacuum option to do only freezing

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Masahiko Sawada wrote:

> On Tue, Nov 6, 2018 at 2:48 AM Robert Haas  wrote:
> >
> > On Mon, Nov 5, 2018 at 3:12 AM Masahiko Sawada  
> > wrote:
> > > Adding a field-and-value style option might be worth. Or maybe we can
> > > add one option for example freeze_without_index_cleanup?
> >
> > That seems non-orthogonal.  We have an existing flag to force freezing
> > (FREEZE); we don't need a second option that does the same thing.
> > Skipping the index scans (and thus necessarily the second heap pass)
> > is a separate behavior which we don't currently have a way to control.
> 
> We already have disable_page_skipping option, not (page_skipping
> false). So ISMT disable_index_cleanup would be more natural. Also,
> since what to do with this option is not only skipping vacuum indexes
> but also skipping removeing dead tuples on heap, I think that the
> option should have a more understandable name for users indicating
> that both it removes dead tuples less than the normal vacuum and it's
> aimed to freeze tuple more faster. Of course we document them, though.

I would name this based on the fact that it freezes quickly -- something
like FAST_FREEZE perhaps.  The other things seem implementation details.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

NULL handling in exconfig deconstruction

2018-11-08 Thread Daniel Gustafsson

In extension_config_remove() we first ensure that pg_extension.extconfig cannot
contain any NULL values in the array, ERRORing out if so.  Later we however ask
for NULL values back when deconstructing the array, throwing away the results
knowing there wont be any, which seems superfluous (and wrong if there indeed
were any).  The attached patch signals to deconstruct_array() that any NULL
values should be considered an error instead of returned, to keep NULL value
handling consistent (and shave two small allocations off).

cheers ./daniel



extconfig_nulls.patch
Description: Binary data

Re: move PartitionBoundInfo creation code

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Amit Langote wrote:

> On 2018/11/08 0:04, Alvaro Herrera wrote:
> >> On Wed, Nov 07, 2018 at 03:34:38PM +0900, Amit Langote wrote:
> >>> I think we can design the interface of partition_bounds_create such that
> >>> it returns information needed to re-arrange OIDs to be in the canonical
> >>> partition bound order, so the actual re-ordering of OIDs can be done by
> >>> the caller itself.  (It may not always be OIDs, could be something else
> >>> like a struct to represent fake partitions.)
> > 
> > This is interesting.  I don't think the current interface is so bad that
> > we can't make a few tweaks later; IOW let's focus on the current patch
> > for now, and we can improve later.  You (and David, and maybe someone
> > else) already have half a dozen patches touching this code, and I bet
> > some of these will have to be rebased over this one.
> 
> The patch on ATTACH/DETACH concurrently thread is perhaps touching close
> to here, but I'm not sure if any of it should be concerned about how we
> generate the PartitionBoundInfo which the patch here trying to make into
> its own function?

Yeah, definitely not.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimze usage of immutable functions as relation

2018-11-08 Thread Antonin Houska

Kyotaro HORIGUCHI  wrote:

> I had the following error with the following query.
> 
> =# explain select * from pg_stat_get_activity(NULL) a join log(10.0) p on 
> a.pid = p.p;
> ERROR:  no relation entry for relid 2
> 

I think that the problem is that RTE_VALUES is wrapped in a subquery by parser
while RTE_FUNCTION is not. Thus some additional processing is done by
pull_up_simple_subquery() for VALUES. What seems to make difference here is
the call of flatten_join_alias_vars() on the query targetlist, although
pull_up_simple_subquery() does it for other reasons.

Actually the comment about flatten_join_alias_vars() in
pull_up_simple_subquery() makes me think if it's o.k. that
pull_up_simple_function() sets rvcontext.need_phvs=false regardless strictness
of the function: I think PlaceHolderVar might be needed if the function is not
strict. (In contrast, pull_up_simple_values() does not have to care because
pull_up_simple_subquery() handles such cases when it's processing the owning
subquery).

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Re: Optimze usage of immutable functions as relation

2018-11-08 Thread Antonin Houska

Aleksandr Parfenov  wrote:

> I fixed a typo and some comments. Please find new version attached.

I've checked this verstion too.

* is_simple_stable_function()

instead of fetching HeapTuple from the syscache manually, you might want to
consider using functions from lsyscache.c (get_func_rettype, get_func_retset,
etc.), or adding a function that returns (subset of) the fields you need in a
single call.

* pull_up_simple_function():

As you assume that ret->functions is a single-item list

Assert(list_length(rte->functions) == 1);

the following iteration is not necessary:

foreach(lc, functions_list)

Also, there seems to be a lot of copy & paste from pull_up_simple_values(), so
some refactoring would make sense.


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Re: PostgreSQL vs SQL/XML Standards

2018-11-08 Thread Markus Winand

> On 2018-11-6, at 15:23 , Pavel Stehule  wrote:
> 
> 
> 
> po 29. 10. 2018 v 11:45 odesílatel Pavel Stehule  
> napsal:
> 
> 
> po 29. 10. 2018 v 10:11 odesílatel Pavel Stehule  
> napsal:
> Hi
> 
> čt 25. 10. 2018 v 21:47 odesílatel Alvaro Herrera  
> napsal:
> On 2018-Oct-25, Pavel Stehule wrote:
> 
> > I am thinking so I can fix some issues related to XMLTABLE. Please, send me
> > more examples and test cases.
> 
> Please see Markus Winand's patch that I referenced upthread.
> 
> here is a fix of some XMLTABLE mentioned issues.
> 
> this update allows cast boolean to numeric types from XPath expressions
> 
> Attached patch solves some cast issues mentioned by Chap. It solves issue 
> reported by Markus. I didn't use Markus's code, but it was inspiration for 
> me. I found native solution from libxml2.
> 
> Regards
> 
> Pavel

Better than my patch.

But I think the chunk in xml_xmlnodetoxmltype of my patch is still needed — in 
one way or the other (see below).

# select * from xmltable('*' PASSING 'pre&deeppost' COLUMNS x XML PATH 'node()');
x
-
 prec1arg&ent1&deeppost
(1 row)

Output is not the original XML.

I dug a little further and found another case that doesn’t looks right even 
with my change to xml_xmlnodetoxmltype applied:

# select * from xmltable('*' PASSING 'pre&deeppost' COLUMNS x XML PATH '/');
 x
---
 pre&ent1&deeppost
(1 row)

Oracle gives in both cases XML.

To fix that I included XML_DOCUMENT_NODE in the list of nodes that use 
xmlNodeDump. Now I wonder if that logic should be reversed to use the 
xmlXPathCastNodeToString branch in a few selected cases but default to the 
branch xmlNodeDump for all other cases?

I guess those few cases might be XML_ATTRIBUTE_NODE and XML_TEXT_NODE. 
Regression tests are happy with that approach but I don’t think that proves a 
lot.

-markus

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 37d85f7..7c1f884 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3682,7 +3682,7 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext 
*xmlerrcxt)
 {
xmltype*result;

-   if (cur->type == XML_ELEMENT_NODE)
+   if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
{
xmlBufferPtr buf;
xmlNodePtr  cur_copy;

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Konstantin Knizhnik




On 08.11.2018 15:23, Laurenz Albe wrote:

Tom Lane wrote:

I wanted to enumerate my concerns while yesterday's
events are still fresh in mind.  (Andres or Robert might have more.)

* I do not understand why this feature is on-by-default in the first
place.  It can only be a win for expression indexes that are many-to-one
mappings; for indexes that are one-to-one or few-to-one, it's a pretty
big loss.  I see no reason to assume that most expression indexes fall
into the first category.  I suggest that the design ought to be to use
this optimization only for indexes for which the user has explicitly
enabled recheck_on_update.  That would allow getting rid of the cost check
in IsProjectionFunctionalIndex, about which we'd otherwise have to have
an additional fight: I do not like its ad-hoc-ness, nor the modularity
violation (and potential circularity) involved in having the relcache call
cost_qual_eval.

That was my impression too when I had a closer look at this feature.

What about an option "hot_update_check" with values "off" (default),
"on" and "always"?

Yours,
Laurenz Albe

Before doing any other refactoring of projection indexes I want to 
attach small bug fix patch which
fixes the original problem (SIGSEGV) and also disables recheck_on_update 
by default.
As Laurenz has suggested, I replaced boolean recheck_on_update option 
with "on","auto,"off" (default).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 4df3d75..e17dd09 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -365,8 +365,9 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
  
Specifies whether to recheck a functional index value to see whether
-   we can use a HOT update or not. The default value is on for functional
-   indexes with an total expression cost less than 1000, otherwise off.
+   we can use a HOT update or not. Accepted values are "on", "auto" and "off"
+   (default). In case of "auto", recheck is done for indexes with an total expression
+   cost less than 1000.
You might decide to turn this off if you knew that a function used in
an index is unlikely to return the same value when one of the input
columns is updated and so the recheck is not worth the additional cost
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index db84da0..9fb5ebc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -19,6 +19,7 @@
 
 #include "access/gist_private.h"
 #include "access/hash.h"
+#include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/reloptions.h"
@@ -131,15 +132,6 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
-			"recheck_on_update",
-			"Recheck functional index expression for changed value after update",
-			RELOPT_KIND_INDEX,
-			ShareUpdateExclusiveLock	/* since only applies to later UPDATEs */
-		},
-		true
-	},
-	{
-		{
 			"security_barrier",
 			"View acts as a row security barrier",
 			RELOPT_KIND_VIEW,
@@ -448,6 +440,18 @@ static relopt_string stringRelOpts[] =
 		validateWithCheckOption,
 		NULL
 	},
+	{
+		{
+			"recheck_on_update",
+			"Recheck functional index expression for changed value after update",
+			RELOPT_KIND_INDEX,
+			ShareUpdateExclusiveLock	/* since only applies to later UPDATEs */
+		},
+		3,
+		false,
+		validateRecheckOnUpdateOption,
+		"off"
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1499,7 +1503,7 @@ index_generic_reloptions(Datum reloptions, bool validate)
 	GenericIndexOpts *idxopts;
 	relopt_value *options;
 	static const relopt_parse_elt tab[] = {
-		{"recheck_on_update", RELOPT_TYPE_BOOL, offsetof(GenericIndexOpts, recheck_on_update)}
+		{"recheck_on_update", RELOPT_TYPE_STRING, offsetof(GenericIndexOpts, recheck_on_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fb63471..21cdc1b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4491,6 +4491,21 @@ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
 	}
 }
 
+
+void validateRecheckOnUpdateOption(const char *value)
+{
+	if (value == NULL ||
+		(strcmp(value, "on") != 0 &&
+		 strcmp(value, "off") != 0 &&
+		 strcmp(value, "auto") != 0))
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for \"recheck_on_update\" option"),
+ errdetail("Valid values are \"on\", \"off\", and \"auto\".")));
+	}
+}
+
 /*
  * Check whether the value is unchanged after update of a projection
  * functional index. Compare the new and old values of the indexed
@@ -4546,9 +4561,10 @@ ProjIndexIsUnchanged(Relation relation, HeapTuple oldtup, HeapTuple newtup)
 }
 else

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Sergei Kornilov

Hi

>>  Sure, but what are we going to achieve with that number? What
>>  information user is going to get by that? If it can help us to ensure
>>  that it has reset the expected number of statements, then I can see
>>  the clear usage, but without that, the return value doesn't seem to
>>  have any clear purpose. So, I don't see much value in breaking
>>  compatibility.
>>
>>  Does anyone else have an opinion on this matter?
>
> This was proposed by Sergei Kornilov in
> https://postgr.es/m/3368121530260...@web21g.yandex.ru saying that "it
> would be nice" to return it. Maybe he has an use case in mind? I don't
> see one myself.
No, i have no specific usecase for this. Silently remove all matching rows and 
return void is ok for me. But i still think LOG ereport is not useful.

regards, Sergei

proposal: variadic argument support for least, greatest function

2018-11-08 Thread Pavel Stehule

Hi

We can pass variadic arguments as a array to any variadic function. But
some our old variadic functions doesn't supports this feature.

We cannot to write

SELECT least(VARIADIC ARRAY[1,2,3]);

Attached patch add this possibility to least, greatest functions.

Regards

Pavel
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 885da18306..53021b1ae2 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1852,13 +1852,18 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_MinMaxExpr:
 			{
 MinMaxExpr *minmaxexpr = (MinMaxExpr *) node;
-int			nelems = list_length(minmaxexpr->args);
+int			nelems;
 TypeCacheEntry *typentry;
 FmgrInfo   *finfo;
 FunctionCallInfo fcinfo;
 ListCell   *lc;
 int			off;
 
+if (minmaxexpr->args)
+	nelems = list_length(minmaxexpr->args);
+else
+	nelems = 1;
+
 /* Look up the btree comparison function for the datatype */
 typentry = lookup_type_cache(minmaxexpr->minmaxtype,
 			 TYPECACHE_CMP_PROC);
@@ -1895,16 +1900,29 @@ ExecInitExprRec(Expr *node, ExprState *state,
 scratch.d.minmax.finfo = finfo;
 scratch.d.minmax.fcinfo_data = fcinfo;
 
-/* evaluate expressions into minmax->values/nulls */
-off = 0;
-foreach(lc, minmaxexpr->args)
+if (minmaxexpr->args)
 {
-	Expr	   *e = (Expr *) lfirst(lc);
+	scratch.d.minmax.variadic_arg = false;
 
-	ExecInitExprRec(e, state,
-	&scratch.d.minmax.values[off],
-	&scratch.d.minmax.nulls[off]);
-	off++;
+	/* evaluate expressions into minmax->values/nulls */
+	off = 0;
+	foreach(lc, minmaxexpr->args)
+	{
+		Expr	   *e = (Expr *) lfirst(lc);
+
+		ExecInitExprRec(e, state,
+		&scratch.d.minmax.values[off],
+		&scratch.d.minmax.nulls[off]);
+		off++;
+	}
+}
+else
+{
+	scratch.d.minmax.variadic_arg = true;
+
+	ExecInitExprRec(minmaxexpr->variadic_arg, state,
+		&scratch.d.minmax.values[0],
+		&scratch.d.minmax.nulls[0]);
 }
 
 /* and push the final comparison */
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index c08df6f162..e26487d315 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2791,6 +2791,53 @@ ExecEvalMinMax(ExprState *state, ExprEvalStep *op)
 	/* default to null result */
 	*op->resnull = true;
 
+	if (op->d.minmax.variadic_arg)
+	{
+		ArrayIterator array_iterator;
+		ArrayType  *arr;
+		Datum	value;
+		bool	isnull;
+
+		if (nulls[0])
+			return;
+
+		arr = DatumGetArrayTypeP(values[0]);
+
+		array_iterator = array_create_iterator(arr, 0, NULL);
+		while (array_iterate(array_iterator, &value, &isnull))
+		{
+			if (isnull)
+continue;
+
+			if (*op->resnull)
+			{
+/* first nonnull input */
+*op->resvalue = value;
+*op->resnull = false;
+			}
+			else
+			{
+int			cmpresult;
+
+/* apply comparison function */
+fcinfo->arg[0] = *op->resvalue;
+fcinfo->arg[1] = value;
+
+fcinfo->isnull = false;
+cmpresult = DatumGetInt32(FunctionCallInvoke(fcinfo));
+if (fcinfo->isnull) /* probably should not happen */
+	continue;
+
+if (cmpresult > 0 && operator == IS_LEAST)
+	*op->resvalue = value;
+else if (cmpresult < 0 && operator == IS_GREATEST)
+	*op->resvalue = value;
+			}
+		}
+
+		return;
+	}
+
 	for (off = 0; off < op->d.minmax.nelems; off++)
 	{
 		/* ignore NULL inputs */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e8ea59e34a..3df2d42ad6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1942,6 +1942,7 @@ _copyMinMaxExpr(const MinMaxExpr *from)
 	COPY_SCALAR_FIELD(inputcollid);
 	COPY_SCALAR_FIELD(op);
 	COPY_NODE_FIELD(args);
+	COPY_NODE_FIELD(variadic_arg);
 	COPY_LOCATION_FIELD(location);
 
 	return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 3bb91c9595..1c6fddac2a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -634,6 +634,7 @@ _equalMinMaxExpr(const MinMaxExpr *a, const MinMaxExpr *b)
 	COMPARE_SCALAR_FIELD(inputcollid);
 	COMPARE_SCALAR_FIELD(op);
 	COMPARE_NODE_FIELD(args);
+	COMPARE_NODE_FIELD(variadic_arg);
 	COMPARE_LOCATION_FIELD(location);
 
 	return true;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index a10014f755..2fbdcd78f8 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -465,6 +465,9 @@ exprTypmod(const Node *expr)
 int32		typmod;
 ListCell   *arg;
 
+if (mexpr->variadic_arg)
+	return exprTypmod((Node *) mexpr->variadic_arg);
+
 if (exprType((Node *) linitial(mexpr->args)) != minmaxtype)
 	return -1;
 typmod = exprTypmod((Node *) linitial(mexpr->args));
@@ -2065,7 +2068,15 @@ expression_tree_walker(Node *node,
 		case T_CoalesceExpr:
 			return walk

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Nov-08, Arthur Zakirov wrote:
>> I just meant something like this (additional "{", "}" braces):

> We omit braces when there's an individual statement.  (We do add the
> braces when we have a comment atop the individual statement, though, to
> avoid pgindent from doing a stupid thing.)

For the record --- I just checked, and pgindent will not mess up code like

if (condition)
/* comment here */
do_something();

at least not as long as the comment is short enough for one line.
(If it's a multiline comment, it seems to want to put a blank line
in front of it, which is not very nice in this context.)

Visually, however, I think this is better off with braces because
it *looks* like a multi-line if-block.  The braces also make it
clear that your intent was not, say,

while (some-mutable-condition)
/* skip */ ;
do_something_else();

regards, tom lane

Re: shared-memory based stats collector

2018-11-08 Thread Tomas Vondra





On 11/8/18 12:46 PM, Kyotaro HORIGUCHI wrote:

Hello. Thank you for looking this.

At Tue, 30 Oct 2018 01:49:59 +0100, Tomas Vondra  wrote 
in <5253d750-890b-069b-031f-2a9b73e47...@2ndquadrant.com>

Hi,

I've started looking at the patch over the past few days. I don't have
any deep insights at this point, but there seems to be some sort of
issue in pgstat_update_stat. When building using gcc, I do get this
warning:

pgstat.c: In function ‘pgstat_update_stat’:
pgstat.c:648:18: warning: ‘now’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
oldest_pending = now;
~~~^
PostgreSQL installation complete.


Uggh! The reason for the code is "last_report = now" comes later
than the code around... Fixed.


When running this under valgrind, I get a couple of warnings in this
area of code - see the attached log with a small sample. Judging by
the locations I assume those are related to the same issue, but I have
not looked into that.


There was several typos/thinkos related to pointers modifed from
original variables. There was a code like the following in the
original code.

   memset(&shared_globalStats, 0, siazeof(shared_globalStats));

It was not fixed despite this patch changes the type of the
variable from PgStat_GlboalStats to (PgStat_GlobalStats *). As
the result major part of the varialbe remaineduninitialized.

I re-ran this version on valgrind and I didn't see such kind of
problem. Thank you for the testing.



OK, regression tests now seem to pass without any valgrind issues.

However a quite a few extensions in contrib seem are broken now. It 
seems fixing it is as simple as including the new bestatus.h next to 
pgstat.h.


I'm not sure splitting the headers like this is needed, actually. It's 
true we're replacing pgstat.c with something else, but it's still 
related to stats, backing pg_stat_* system views etc. So I'd keep as 
much of the definitions in pgstat.h, so that it's enough to include that 
one header file. That would "unbreak" the extensions.


Renaming pgstat_report_* functions to bestatus_report_* seems 
unnecessary to me too. The original names seem quite fine to me.


BTW the updated patches no longer apply cleanly. Apparently it got 
broken since Tuesday, most likely by the pread/pwrite patch.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [PATCH] Generic type subscripting

2018-11-08 Thread Dmitry Dolgov

> On Thu, 8 Nov 2018 at 06:14, Pavel Stehule  wrote:
>
>> Now it's my turn to disagree. As an argument I have this thread [1], where
>> similar discussion happened about flexibility of jsonb and throwing an errors
>> (in this particular case whether or not to throw an error when a non existing
>> path was given to jsonb_set).
>
> It doesn't mean so it is designed well.
>>
>> I can imagine significant number of use cases when adding a value to jsonb 
>> like
>> that is desirable outcome, and I'm not sure if I can come up with an example
>> when strictness is the best result. Maybe if you have something in mind, you
>> can describe what would be the case for that? Also as I've mentioned before,
>> consistency between jsonb_set and jsonb subscripting operator will help us to
>> avoid tons of question about why I can do this and this using one option, but
>> not another.
>
> I have only one argument - with this behave nobody knows if value was 
> appended or updated.

Well, maybe you're right, and I would love to discuss our approach to modify
jsonb values, but the point is that the purpose of this patch is to
provide a new
"friendly" syntax to do so, not to change how it works or provide an
alternative version of update functionality.

Even if you'll convince me that subscripting for jsonb now should behave
differently from jsonb_set, I would suggest to do this within a separate patch
set, since the current one is already too big (probably that's why the review
process is so slow).

Re: Disallow setting client_min_messages > ERROR?

2018-11-08 Thread Tom Lane

Andres Freund  writes:
> On 2018-11-06 11:37:40 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> Seems reasonable. I do think it's probably sensible to backpatch,
>>> although I wonder if we shouldn't clamp the value to ERROR at log
>>> emission error time, rather than via guc.c, so we don't prevent old code
>>> / postgresql.conf that set client_min_messages to > ERROR.

>> Hm, do you really think there is any?

> I'm not sure. But it sounds like it'd possibly slow adoption of the
> minor releases if we said "hey, make sure that you nowhere set
> client_min_messages > ERROR", even if it's not particularly meaningful
> thing to do, as it'd still imply a fair bit of work for bigger
> applications with not great standards.

OK, so the consensus seems to be that the back branches should continue
to allow you to set client_min_messages = FATAL/PANIC, but then ignore
that and act as though it were ERROR.

We could implement the clamp either in elog.c or in a GUC assignment
hook.  If we do the latter, then SHOW and pg_settings would report the
effective value rather than what you set.  That seems a bit cleaner
to me, and not without precedent.  As far as the backwards compatibility
angle goes, you can invent scenarios in which either choice could be
argued to break something; but I think the most likely avenue for
trouble is if the visible setting doesn't match the actual behavior.
So I'm leaning to the assign-hook approach; comments?

regards, tom lane

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Tom Lane wrote:

> For the record --- I just checked, and pgindent will not mess up code like
> 
>   if (condition)
>   /* comment here */
>   do_something();
> 
> at least not as long as the comment is short enough for one line.
> (If it's a multiline comment, it seems to want to put a blank line
> in front of it, which is not very nice in this context.)

Yeah, those blank lines are what I've noticed, and IMO they look pretty
bad.

> Visually, however, I think this is better off with braces because
> it *looks* like a multi-line if-block.  The braces also make it
> clear that your intent was not, say,
> 
>   while (some-mutable-condition)
>   /* skip */ ;
>   do_something_else();

Right, that too.  Fortunately I think compilers warn about mismatching
indentation nowadays, at least in some cases.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: shared-memory based stats collector

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Tomas Vondra wrote:

> I'm not sure splitting the headers like this is needed, actually. It's true
> we're replacing pgstat.c with something else, but it's still related to
> stats, backing pg_stat_* system views etc. So I'd keep as much of the
> definitions in pgstat.h, so that it's enough to include that one header
> file. That would "unbreak" the extensions.

pgstat.h includes a lot of other stuff that presumably isn't needed if
all some .c wants is in bestatus.h, so my vote would be to make this
change *if it's actually possible to do it*: you want the affected
headers to compile standalone (use cpluspluscheck or similar to verify
this), for one thing.

> Renaming pgstat_report_* functions to bestatus_report_* seems unnecessary to
> me too. The original names seem quite fine to me.

Yeah, this probably keeps churn to a minimum.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disallow setting client_min_messages > ERROR?

2018-11-08 Thread Andres Freund

Hi,

On 2018-11-08 10:56:33 -0500, Tom Lane wrote:
> OK, so the consensus seems to be that the back branches should continue
> to allow you to set client_min_messages = FATAL/PANIC, but then ignore
> that and act as though it were ERROR.

Sounds good.


> We could implement the clamp either in elog.c or in a GUC assignment
> hook.  If we do the latter, then SHOW and pg_settings would report the
> effective value rather than what you set.  That seems a bit cleaner
> to me, and not without precedent.  As far as the backwards compatibility
> angle goes, you can invent scenarios in which either choice could be
> argued to break something; but I think the most likely avenue for
> trouble is if the visible setting doesn't match the actual behavior.
> So I'm leaning to the assign-hook approach; comments?

Seems reasonable.

Greetings,

Andres Freund

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-07, Amit Langote wrote:

> I think the result in this case should be an error, just as it would in
> the regular inheritance case.
> 
> create table parent (a text);
> create table child (a text collate "en_US") inherits (parent);
> NOTICE:  merging column "a" with inherited definition
> ERROR:  column "a" has a collation conflict
> DETAIL:  "default" versus "en_US"
> 
> In fact, I see that ATTACH PARTITION rejects a partition if collations
> don't match.

Hmm, I'm thinking perhaps we shouldn't backpatch this part.  It's
obviously a bug, but we might break somebody's working apps.  Therefore
I think I'd rather leave it out of the current bugfix and commit
separately.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disallow setting client_min_messages > ERROR?

2018-11-08 Thread Jonah H. Harris

On Thu, Nov 8, 2018 at 10:56 AM Tom Lane  wrote:

> OK, so the consensus seems to be that the back branches should continue
> to allow you to set client_min_messages = FATAL/PANIC, but then ignore
> that and act as though it were ERROR.
>

Agreed.


> We could implement the clamp either in elog.c or in a GUC assignment
> hook.  If we do the latter, then SHOW and pg_settings would report the
> effective value rather than what you set.  That seems a bit cleaner
> to me, and not without precedent.  As far as the backwards compatibility
> angle goes, you can invent scenarios in which either choice could be
> argued to break something; but I think the most likely avenue for
> trouble is if the visible setting doesn't match the actual behavior.
> So I'm leaning to the assign-hook approach; comments?
>

My patch used the check hook, but works either way.

-- 
Jonah H. Harris

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Amit Langote

On Fri, Nov 9, 2018 at 1:03 AM Alvaro Herrera  wrote:
> On 2018-Nov-07, Amit Langote wrote:
>
> > I think the result in this case should be an error, just as it would in
> > the regular inheritance case.
> >
> > create table parent (a text);
> > create table child (a text collate "en_US") inherits (parent);
> > NOTICE:  merging column "a" with inherited definition
> > ERROR:  column "a" has a collation conflict
> > DETAIL:  "default" versus "en_US"
> >
> > In fact, I see that ATTACH PARTITION rejects a partition if collations
> > don't match.
>
> Hmm, I'm thinking perhaps we shouldn't backpatch this part.  It's
> obviously a bug, but we might break somebody's working apps.  Therefore
> I think I'd rather leave it out of the current bugfix and commit
> separately.

Okay, that may be fine because nothing wrong is happening by silently
ignoring the partition's specified collation.

Thanks,
Amit

Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-08 Thread Robert Haas

On Wed, Nov 7, 2018 at 1:37 PM Robert Haas  wrote:
> > Maybe you could give my patch a look.
>
> I have, a bit.

While thinking about this problem a bit more, I realized that what is
called RelationBuildPartitionDesc in master and BuildPartitionDesc in
Alvaro's patch has a synchronization problem as soon as we start to
reduce lock levels.  At some point, find_inheritance_children() gets
called to get a list of the OIDs of the partitions. Then, later,
SysCacheGetAttr(RELOID, ...) gets called for each one to get its
relpartbound value.  But since catalog lookups use the most current
snapshot, they might not see a compatible view of the catalogs.

That could manifest in a few different ways:

- We might see a newer version of relpartbound, where it's now null
because it's been detached.
- We might see a newer version of relpartbound where it now has an
unrelated value because it has been detached and then reattached to
some other partitioned table.
- We might see newer versions of relpartbound for some tables than
others.  For instance, suppose we had partition A for 1..200 and B for
201..300.  Then we realize that this is not what we actually wanted to
do, so we detach A and reattach it with a bound of 1..100 and detached
B and reattach it with a bound of 101..300.  If we perform the
syscache lookup for A before this happens and the syscache lookup for
B after this happens, we might see the old bound for A and the new
bound for B, and that would be sad, 'cuz they overlap.
- Seeing an older relpartbound for some other table is also a problem
for other reasons -- we will have the wrong idea about the bounds of
that partition and may put the wrong tuples into it.  Without
AccessExclusiveLock, I don't think there is anything that keeps us
from reading stale syscache entries.

Alvaro's patch defends against the first of these cases by throwing an
error, which, as I already said, I don't think is acceptable, but I
don't see any defense at all against the other cases. The root of the
problem is that the way catalog lookups work today - each individual
lookup uses the latest available snapshot, but there is zero guarantee
that consecutive lookups use the same snapshot.  Therefore, as soon as
you start lowering lock levels, you are at risk for inconsistent data.

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups.  That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all.  Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

Note that this problem still exists even if you allow concurrent
attach but not concurrent detach, but it's not as bad, because when
you encounter a concurrently-attached partition, you know it hasn't
also been concurrently-detached from someplace else.  Presumably you
either see the latest value of the partition bound or the NULL value
which preceded it, but not anything else.  If that's so, then maybe
you could get by without using a consistent snapshot for all of your
information gathering: if you see NULL, you know that the partition
was concurrently added and you just ignore it. There's still no
guarantee that all parallel workers would come to the same conclusion,
though, which doesn't feel too good.

Personally, I don't think it's right to blame that problem on parallel
query.  The problem is more general than that: we assume that holding
any kind of a lock on a relation is enough to keep the important
details of the relation static, and therefore it's fine to do
staggered lookups within one backend, and it's also fine to do
staggered lookups across different backends.  When you remove the
basic assumption that any lock is enough to prevent concurrent DDL,
then the whole idea that you can do different lookups at different
times with different snapshots (possibly in different backends) and
get sane answers also ceases to be correct.  But the idea that you can
look up different bits of catalog data at whatever time is convenient
undergirds large amounts of our current machinery -- it's built into
relcache, syscache, sinval, ...

I think that things get even crazier if we postpone locking on
individual partitions until we need to do something with that
partition, as has been proposed elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-09, Amit Langote wrote:

> On Fri, Nov 9, 2018 at 1:03 AM Alvaro Herrera  
> wrote:
> > On 2018-Nov-07, Amit Langote wrote:

> > Hmm, I'm thinking perhaps we shouldn't backpatch this part.  It's
> > obviously a bug, but we might break somebody's working apps.  Therefore
> > I think I'd rather leave it out of the current bugfix and commit
> > separately.
> 
> Okay, that may be fine because nothing wrong is happening by silently
> ignoring the partition's specified collation.

Actually, how about we reduce the message to a WARNING in released
branches, and ERROR in master?  Broken applications would continue to
work, but users might become aware that there might be a problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Magnus Hagander

On Thu, Nov 8, 2018 at 3:53 PM Sergei Kornilov  wrote:

> Hi
>
> >>  Sure, but what are we going to achieve with that number? What
> >>  information user is going to get by that? If it can help us to ensure
> >>  that it has reset the expected number of statements, then I can see
> >>  the clear usage, but without that, the return value doesn't seem to
> >>  have any clear purpose. So, I don't see much value in breaking
> >>  compatibility.
> >>
> >>  Does anyone else have an opinion on this matter?
> >
> > This was proposed by Sergei Kornilov in
> > https://postgr.es/m/3368121530260...@web21g.yandex.ru saying that "it
> > would be nice" to return it. Maybe he has an use case in mind? I don't
> > see one myself.
> No, i have no specific usecase for this. Silently remove all matching rows
> and return void is ok for me. But i still think LOG ereport is not useful.
>

I would much prefer it to be a return code rather than a forced LOG
message. Log message spam is very much a thing, and things that are logged
as LOG will always be there.

It could also be made to take a parameter saying log yes/no with a default
value, but that seems like possible overengineering of a fairly simple
functionality.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

security release

2018-11-08 Thread Pavel Stehule

Ahoj

dnes byly zverejneny opravne verze viz
https://www.postgresql.org/about/news/1905/

Pokud si hrajete s PostgreSQL 11, pripadne ji pouzivate produkcne, tak
neodkladejte upgrade.

Pavel

Re: security release

2018-11-08 Thread Robert Haas

On Thu, Nov 8, 2018 at 12:31 PM Pavel Stehule  wrote:
> Ahoj
>
> dnes byly zverejneny opravne verze viz 
> https://www.postgresql.org/about/news/1905/
>
> Pokud si hrajete s PostgreSQL 11, pripadne ji pouzivate produkcne, tak 
> neodkladejte upgrade.

Wrong list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ON COMMIT actions and inheritance

2018-11-08 Thread Robert Haas

On Thu, Nov 8, 2018 at 4:04 AM Michael Paquier  wrote:
> Even if the style gets heavier, I have also the following in my box:
> When used on a partitioned table, this action drops its partitions and
> when used on tables with inheritance children, it drops the depending
> children.

It should be "dependent" not "depending".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: security release

2018-11-08 Thread Pavel Stehule

čt 8. 11. 2018 v 18:40 odesílatel Robert Haas 
napsal:

> On Thu, Nov 8, 2018 at 12:31 PM Pavel Stehule 
> wrote:
> > Ahoj
> >
> > dnes byly zverejneny opravne verze viz
> https://www.postgresql.org/about/news/1905/
> >
> > Pokud si hrajete s PostgreSQL 11, pripadne ji pouzivate produkcne, tak
> neodkladejte upgrade.
>
> Wrong list?
>

yes, I am sorry.

Regards

Pavel

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: New vacuum option to do only freezing

2018-11-08 Thread Robert Haas

On Thu, Nov 8, 2018 at 4:36 AM Masahiko Sawada  wrote:
> We already have disable_page_skipping option, not (page_skipping
> false). So ISMT disable_index_cleanup would be more natural.

Sure.

> Also,
> since what to do with this option is not only skipping vacuum indexes
> but also skipping removeing dead tuples on heap, I think that the
> option should have a more understandable name for users indicating
> that both it removes dead tuples less than the normal vacuum and it's
> aimed to freeze tuple more faster. Of course we document them, though.

Well, I actually don't think that you should control two behaviors
with the same option.  If you want to vacuum and skip index cleanup,
you should say VACUUM (disable_index_cleanup).  If you want to vacuum,
disable index cleanup, and skip aggressively, you should say VACUUM
(freeze, disable_index_cleanup).  Both behaviors seem useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Should new partitions inherit their tablespace from their parent?

2018-11-08 Thread Robert Haas

On Wed, Nov 7, 2018 at 9:43 PM Michael Paquier  wrote:
> On Thu, Nov 08, 2018 at 12:50:40PM +1300, David Rowley wrote:
> > How about we record the tablespace option for the partitioned table in
> > reltablespace instead of saving it as 0.  Newly created partitions
> > which don't have a TABLESPACE mentioned in the CREATE TABLE command
> > should be created in their direct parent partitioned tables
> > tablespace.
>
> I have seen enough complains on the mailing lists regarding the way
> tablespaces are handled for partitioned tables and their partitions that
> it looks like a very good idea to make the tablespace being inherited
> automatically, by setting up reltablespace to a non-zero value even if
> a partitioned table has no physical presence.  Of course not on v11 or
> older releases, just on HEAD.  It is no good to have partitioned indexes
> and partitioned tables being handling inconsistently for such things.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Postgres, fsync, and OSs (specifically linux)

2018-11-08 Thread Robert Haas

On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
 wrote:
> My plan is do a round of testing and review of this stuff next week
> once the dust is settled on the current minor releases (including
> fixing a few typos I just spotted and some word-smithing).  All going
> well, I will then push the resulting patches to master and all
> supported stable branches, unless other reviews or objections appear.
> At some point not too far down the track I hope to be ready to
> consider committing that other patch that will completely change all
> of this code in the master branch, but in any case Craig's patch will
> get almost a full minor release cycle to sit in the stable branches
> before release.

I did a read-through of these patches.

+ new_requests = entry->requests[forknum];
+ entry->requests[forknum] =
+ bms_join(new_requests, requests);

What happens if bms_join fails, too?

+recover from the WAL after any failure is reported, preferrably

preferably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Alvaro Herrera

On 2018-Nov-08, Konstantin Knizhnik wrote:

> Before doing any other refactoring of projection indexes I want to attach
> small bug fix patch which
> fixes the original problem (SIGSEGV) and also disables recheck_on_update by
> default.
> As Laurenz has suggested, I replaced boolean recheck_on_update option with
> "on","auto,"off" (default).

I think this causes an ABI break for GenericIndexOpts.  Not sure to what
extent that is an actual problem (i.e. how many modules were compiled
with 11.0 that are gonna be reading that struct with later Pg), but I
think it should be avoided anyhow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-08 Thread Robert Haas

On Thu, Nov 8, 2018 at 6:28 AM David Rowley
 wrote:
> I've attached v15 and a delta from v14 to ease re-review.
>
> I also ran pgindent on this version. That's not part of the delta but
> is in the main patch.

Did you notice 
http://postgr.es/m/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03
?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Andres Freund

Hi,

On 2018-11-07 14:25:54 -0500, Tom Lane wrote:
> We need to move forward, either by undertaking a more extensive
> clean-out, or by finding a path to a version of the code that is
> satisfactory.
> [...]
> In short, it seems likely to me that large parts of this patch need to
> be pulled out, rewritten, and then put back in different places than
> they are today.  I'm not sure if a complete revert is the best next
> step, or if we can make progress without that.

I think the feature has merit, but I don't think it makes that much
sense to start with the current in-tree version. There's just too many
architectural issues.  So I think we should clean it out as much as we
can, and then have the feature re-submitted with proper review etc.

Greetings,

Andres Freund

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Alvaro Herrera

Pushed.

I included the test case for collations to the three branches, but no
code changes.  We can patch master for the handling of collations per
your patch, and if somebody has it, a change to how defaults are applied
when routing tuples.

Thanks to Jürgen for reporting the bug.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgres, fsync, and OSs (specifically linux)

2018-11-08 Thread Thomas Munro

On Fri, Nov 9, 2018 at 7:07 AM Robert Haas  wrote:
> On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
>  wrote:
> > My plan is do a round of testing and review of this stuff next week
> > once the dust is settled on the current minor releases (including
> > fixing a few typos I just spotted and some word-smithing).  All going
> > well, I will then push the resulting patches to master and all
> > supported stable branches, unless other reviews or objections appear.
> > At some point not too far down the track I hope to be ready to
> > consider committing that other patch that will completely change all
> > of this code in the master branch, but in any case Craig's patch will
> > get almost a full minor release cycle to sit in the stable branches
> > before release.
>
> I did a read-through of these patches.
>
> + new_requests = entry->requests[forknum];
> + entry->requests[forknum] =
> + bms_join(new_requests, requests);
>
> What happens if bms_join fails, too?

My reasoning for choosing bms_join() is that it cannot fail, assuming
the heap is not corrupted.  It simply ORs the two bit-strings into
whichever is the longer input string, and frees the shorter input
string.  (In an earlier version I used bms_union(), this function's
non-destructive sibling, but then realised that it could fail to
allocate() causing us to lose track of a 1 bit).

Philosophical point: if pfree() throws, then bms_join() throws, but
(assuming AllocSetFree() implementation) it can only throw if the heap
is corrupted, eg elog(ERROR, "could not find block containing chunk
%p", chunk) and possibly other errors.  Of course it's impossible to
make guarantees of any kind in case of arbitrary corruption.  But
perhaps we could do this in a critical section, so errors are promoted
to PANIC.

> +recover from the WAL after any failure is reported, preferrably
>
> preferably.

Thanks.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Postgres, fsync, and OSs (specifically linux)

2018-11-08 Thread Robert Haas

On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro
 wrote:
> My reasoning for choosing bms_join() is that it cannot fail, assuming
> the heap is not corrupted.  It simply ORs the two bit-strings into
> whichever is the longer input string, and frees the shorter input
> string.  (In an earlier version I used bms_union(), this function's
> non-destructive sibling, but then realised that it could fail to
> allocate() causing us to lose track of a 1 bit).

Oh, OK.  I was assuming it was allocating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-08 Thread David Rowley

On 9 November 2018 at 05:34, Robert Haas  wrote:
> I suspect the only good way of fixing this problem is using a single
> snapshot to perform both the scan of pg_inherits and the subsequent
> pg_class lookups.  That way, you know that you are seeing the state of
> the whole partitioning hierarchy as it existed at some particular
> point in time -- every commit is either fully reflected in the
> constructed PartitionDesc or not reflected at all.  Unfortunately,
> that would mean that we can't use the syscache to perform the lookups,
> which might have unhappy performance consequences.

I do have a patch sitting around that moves the relpartbound into a
new catalogue table named pg_partition. This gets rid of the usage of
pg_inherits for partitioned tables. I wonder if that problem is easier
to solve with that.  It also solves the issue with long partition keys
and lack of toast table on pg_class.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-08 Thread David Rowley

On 9 November 2018 at 00:28, David Rowley  wrote:
> I've attached v15 and a delta from v14 to ease re-review.

I just revived the 0002 patch, which is still just for testing at this
stage. There was mention over on [1] about removing the
find_all_inheritors() call.

Also some benchmarks from v14 with 0001+0002.

Setup:

DROP TABLE hashp;
CREATE TABLE hashp (a INT) PARTITION BY HASH (a);
SELECT 'CREATE TABLE hashp'||x::Text || ' PARTITION OF hashp FOR
VALUES WITH (modulus 1, remainder ' || x::text || ');' from
generate_Series(0,) x;
\gexec

(0 partitions is a non-partitioned table)

fsync=off

Partitions Patched Unpatched
0 23672 23785
10 22794 18385
100 22392 8541
1000 22419 1159
1 22195 101

[1] 
https://www.postgresql.org/message-id/ca+tgmozgjsy-nrfnzurhzqjthddh5fzjkvbvhs0byn6_46p...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v14-0002-Unsafe-locking-reduction-for-partitioned-INSERT-.patch
Description: Binary data

Re: Disallow setting client_min_messages > ERROR?

2018-11-08 Thread Tom Lane

"Jonah H. Harris"  writes:
> On Thu, Nov 8, 2018 at 10:56 AM Tom Lane  wrote:
>> We could implement the clamp either in elog.c or in a GUC assignment
>> hook.  If we do the latter, then SHOW and pg_settings would report the
>> effective value rather than what you set.  That seems a bit cleaner
>> to me, and not without precedent.  As far as the backwards compatibility
>> angle goes, you can invent scenarios in which either choice could be
>> argued to break something; but I think the most likely avenue for
>> trouble is if the visible setting doesn't match the actual behavior.
>> So I'm leaning to the assign-hook approach; comments?

> My patch used the check hook, but works either way.

I was deliberately not getting into the detail of which hook to use ;-).

Anyway, pushed with some adjustments and work on the documentation.
Notably, I thought the warning message was inappropriate and
overcomplicated, so I just dropped it.  I don't think we really need
anything there.

regards, tom lane

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Nov-08, Konstantin Knizhnik wrote:
>> Before doing any other refactoring of projection indexes I want to attach
>> small bug fix patch which
>> fixes the original problem (SIGSEGV) and also disables recheck_on_update by
>> default.
>> As Laurenz has suggested, I replaced boolean recheck_on_update option with
>> "on","auto,"off" (default).

> I think this causes an ABI break for GenericIndexOpts.  Not sure to what
> extent that is an actual problem (i.e. how many modules were compiled
> with 11.0 that are gonna be reading that struct with later Pg), but I
> think it should be avoided anyhow.

I do not see the point of having more than a boolean option anyway,
if the default is going to be "off".  If the user is going to the
trouble of marking a specific index for this feature, what else is
she likely to want other than having it "on"?

The bigger picture here, and the reason for my skepticism about having
any intelligence in the enabling logic, is that there is no scenario
in which this code can be smarter than the user about what to do.
We have no insight today, and are unlikely to have any in future, about
whether a specific index expression is many-to-one or not.  I have
exactly no faith in cost-estimate-based logic either, because of the
extremely poor quality of our function cost estimates --- very little
effort has been put into assigning trustworthy procost numbers to the
built-in functions, and as for user-defined ones, it's probably much
worse.  So that's why I'm on the warpath against the cost-based logic
that's there now; I think it's just garbage-in-garbage-out.

regards, tom lane

Re: Disallow setting client_min_messages > ERROR?

2018-11-08 Thread Jonah H. Harris

On Thu, Nov 8, 2018 at 5:37 PM Tom Lane  wrote:

> > My patch used the check hook, but works either way.
>
> I was deliberately not getting into the detail of which hook to use ;-).
>
> Anyway, pushed with some adjustments and work on the documentation.
> Notably, I thought the warning message was inappropriate and
> overcomplicated, so I just dropped it.  I don't think we really need
> anything there.
>

+1

-- 
Jonah H. Harris

Re: [HACKERS] Surjective functional indexes

2018-11-08 Thread Tom Lane

I wrote:
> The bigger picture here, and the reason for my skepticism about having
> any intelligence in the enabling logic, is that there is no scenario
> in which this code can be smarter than the user about what to do.
> We have no insight today, and are unlikely to have any in future, about
> whether a specific index expression is many-to-one or not.

Hmm ... actually, I take that back.  Since we're only interested in this
for expression indexes, we can expect that statistics will be available
for the expression index, at least for tables that have been around
long enough that UPDATE performance is really an exciting topic.
So you could imagine pulling up the stadistinct numbers for the index
column(s) and the underlying column(s), and enabling the optimization
when their ratio is less than $something.  Figuring out how to merge
numbers for multiple columns might be tricky, but it's not going to be
completely fact-free.  (I still think that the cost-estimate logic is
quite bogus, however.)

Another issue in all this is the cost of doing this work over again
after any relcache flush.  Maybe we could move the responsibility
into ANALYZE?

BTW, the existing code appears to be prepared to enable this logic
if *any* index column is an expression, but surely we should do so
only if they are *all* expressions?

regards, tom lane

Re: Connection limit doesn't work for superuser

2018-11-08 Thread David G. Johnston

On Wed, Nov 7, 2018 at 11:14 AM Tom Lane  wrote:
> If people are okay with having rolconnlimit act
> differently from datconnlimit in this respect, then I'll withdraw
> my objection.

Since the rolconnlimit specifically and precisely targets the
superuser in a narrow manner it makes sense on its face to recognize
it.  That the indirect targeting of all superusers via datconnlimit is
ignored is likewise a reasonable decision.  Ignoring datconnlimit
doesn't reduce the validity of having the rolconnlimit setting be
enforced and I do not see a substantial argument for why doing so
could be harmful to a reasonably skilled operator; while it provides a
reasonable, if likely seldom used, capability that is already long
established for non-superusers.

For me the burden falls onto why rolconnlimit should not be
enforced...regardless of the fact that unenforced is status quo.  We
generally tend toward giving superusers abilities and letting them
decide how to use them and this should be no different.

David J.

ALTER DEFAULT PRIVILEGES is buggy, and so is its testing

2018-11-08 Thread Tom Lane

Buildfarm members crake and xenodermus recently fell over with
very similar symptoms in the pg_upgrade test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2018-11-07%2021%3A47%3A29

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 6350; 0 0 ACL SCHEMA 
"regress_rls_schema" andrew
pg_restore: [archiver (db)] could not execute query: ERROR:  role "33757" does 
not exist
Command was: GRANT ALL ON SCHEMA "regress_rls_schema" TO "33757";

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2018-11-08%2023%3A00%3A01

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 5357; 0 0 ACL SCHEMA 
"mvtest_mvschema" bf
pg_restore: [archiver (db)] could not execute query: ERROR:  role "33894" does 
not exist
Command was: GRANT ALL ON SCHEMA "mvtest_mvschema" TO "33894";


The bogus numeric user "names" are presumably there because aclitemout
will just fall back to printing a referenced role's OID if it fails to
look up the role's name; hence, what we're looking at here is grants
made to since-dropped roles, which somehow didn't get cleaned up.

Now, looking at the regression test scripts that create these particular
schemas, that's rather mind-boggling: they never do any grant or revoke
at all on those schemas, so how'd there come to be any ACL entries?

I think that the answer is that these particular scripts run in parallel
with the "privileges" script, in which we find this:

ALTER DEFAULT PRIVILEGES GRANT ALL ON SCHEMAS TO regress_priv_user2;

What seems to be happening is

1. Given the right concurrent timing, these schemas absorb a GRANT
to regress_priv_user2 during the window in which the above is active.

2. The GRANT apparently doesn't get entered into pg_shdepend, because
when privileges.sql drops the role, there is no complaint.  (It's easy
to reproduce this bug without any parallelism.)

3. The regression tests proper don't notice the dangling ACL, but
dump/restore sure does.

The reason we are just seeing this now, no doubt, is Andres having
changed the pg_upgrade test to run the core regression tests in
parallel.  I'm sure it's happened a whole lot before without anyone
noticing.

So it seems like testing ALTER DEFAULT PRIVILEGES in a script that
runs in parallel with anything else is a damfool idea.  But there
is also a live bug of failure to correctly record privileges granted
this way.  Without that bug, we'd have had our noses rubbed in the
parallel race conditions long ago.

regards, tom lane

Re: Connection slots reserved for replication

2018-11-08 Thread Masahiko Sawada

On Thu, Nov 8, 2018 at 9:30 PM Kyotaro HORIGUCHI
 wrote:
>
> Hello.
>
> At Wed, 7 Nov 2018 19:31:00 +0900, Masahiko Sawada  
> wrote in 
> > On Tue, Nov 6, 2018 at 9:16 PM Kyotaro HORIGUCHI
> >  wrote:
> > > InitializeMaxBackends()
> > > MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
> > > -   max_worker_processes;
> > > +   max_worker_processes + replication_reserved_connections;
> > >
> > > This means walsender doesn't comsume a connection, which is
> > > different from the current behavior. We should reserve a part of
> > > MaxConnections for walsenders.  (in PostmasterMain,
> > > max_wal_senders is counted as a part of MaxConnections)
> >
> > Yes. We can force replication_reserved_connections <= max_wal_senders
> > and then reserved connections for replication should be a part of
> > MaxConnections.
> >
> > >
> > > +   if (am_walsender && replication_reserved_connections < 
> > > max_wal_senders
> > > +   && *procgloballist == NULL)
> > > +   procgloballist = &ProcGlobal->freeProcs;
> > >
> > > Currently exccesive number of walsenders are rejected in
> > > InitWalSenderSlot and emit the following error.
> > >
> > > >ereport(FATAL,
> > > >(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
> > > > errmsg("number of requested standby connections "
> > > >"exceeds max_wal_senders (currently %d)",
> > > >max_wal_senders)));
> > >
> > > With this patch, if max_wal_senders =
> > > replication_reserved_connections = 3 and the fourth walreceiver
> > > comes, we will get "FATAL: sorry, too many clients already"
> > > instead. It should be fixed.
> > >
> > > When r_r_conn = 2 and max_wal_senders = 3 and the three
> > > walsenders are active, in an exreme case where a new replication
> > > connection comes at the same time another is exiting, we could
> > > end up using two normal slots despite that one slot is vacant in
> > > reserved slots.
> >
> > Doesn't the max_wal_senders prevent the case?
>
> Currently the variable doesn't work as so. We once accept the
> connection request and searches for a vacant slot in
> InitWalSenderSlot and reject the connection if it found that no
> room is available. Even with this patch, we don't count the
> accurate number of active walsenders (for performance reason). If
> reserved slot are filled, there's no way other than to accept the
> connection using non-reserved slot if r_r_conn <
> max_wal_senders. If one of active walsenders went away since we
> allocated non-reserved connection slot until InitWalSenderSlot
> starts searching sendnds[] array. Finally the new walsender on
> the unreserved slot is activated, and one reserved slot is left
> empty. So this is "an extreme case". We could ignore the case.
>
> I'm doubt that we should allow the setting where r_r_conn <
> max_wal_senders, or even r_r_conn != max_wal_senders. We don't
> have a problem like this if we don't allow the cases.
>
> > Wal senders can get connection if we have free procs more than
> > (MaxConnections - reserved for superusers). So I think for normal
> > users the connection must be refused if (MaxConnections - (reserved
> > for superuser and replication) > # of freeprocs) and for wal senders
> > the connection also must be refused if (MaxConnections - (reserved for
> > superuser) > # of freeprocs). I'm not sure we need such trick in
> > InitWalSenderSlot().
>
> (For clarity, I don't mean my previous patch is good solution.)
>
> It works as far as we accept that some reserved slots can be left
> unused despite of some walsenders are using normal slots. (Just
> exiting a walsender using reserved slot causes this but it is
> usually occupied by walsenders comes later)
>
> Another idea is we acquire a walsnd[] slot before getting a
> connection slot..

After more thought, I'm inclined to agree to reserve max_wal_senders
slots and not to have replication_reserved_connections parameter.

For superuser_reserved_connection, actually it works so that we
certainly reserve slots for superuser in case where slots are almost
full regardless of who is using other slots incluing superusers
themselves. But replication connections requires different behaviour
as it has the another limit (max_wal_senders). If we have
replication_reserved_connections < max_wal_senders, it could end up
with the same issue as what originally reported on this thread.
Therefore many users would set replication_reserved_connections =
max_wal_senders.

On the other hand, If we always reserve max_wal_senders slots
available slots for normal backend will get decreased in the next
release, which require for users to re-confiugre the max_connection.
But I felt this behavior seems more natural than the current one, so I
think the re-configuration can be acceptable for users.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATIONNTT Open Source Software Center

Re: ON COMMIT actions and inheritance

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 12:53:18PM -0500, Robert Haas wrote:
> On Thu, Nov 8, 2018 at 4:04 AM Michael Paquier  wrote:
>> Even if the style gets heavier, I have also the following in my box:
>> When used on a partitioned table, this action drops its partitions and
>> when used on tables with inheritance children, it drops the depending
>> children.
> 
> It should be "dependent" not "depending".

Thanks for the feedback!  I have committed and back-patched but actually
I am only back-patching that stuff down to v10 as I have noticed the
following limitation on 9.6 and older versions when cascading ON COMMIT
DROP using this test case:
begin;
create temp table temp_inh_oncommit_test (a int) on commit drop;
create temp table temp_inh_oncommit_test1 ()
inherits(temp_inh_oncommit_test) on commit delete rows;
insert into temp_inh_oncommit_test1 values (1);
commit;

At commit time, there is an annoying NOTICE message which should not
show up because of the automatic cascading drop:
NOTICE:  drop cascades to table temp_inh_oncommit_test1

The issue is that PERFORM_DELETION_QUIETLY is not available back then,
and this requires a bit of refactoring around reportDependentObjects(),
particularly changing silently one of its arguments which seems risky to
introduce on a stable branch, so I am taking the safest path (this is a
static routine but I am ready to bet that some forks would be unhappy
with the suddent change and that may not trigger a compilation failure).
This bug is also quite old, and there have not been complains until I
discovered it, so it does not seem worth taking any risk.
--
Michael


signature.asc
Description: PGP signature

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 12:13:31PM -0300, Alvaro Herrera wrote:
> On 2018-Nov-08, Tom Lane wrote:
>> Visually, however, I think this is better off with braces because
>> it *looks* like a multi-line if-block.  The braces also make it
>> clear that your intent was not, say,
>> 
>>  while (some-mutable-condition)
>>  /* skip */ ;
>>  do_something_else();
> 
> Right, that too.  Fortunately I think compilers warn about mismatching
> indentation nowadays, at least in some cases.

I don't recall seeing a compiler warning about that, but I do recall
coverity complaining loudly about such things.  It is better style to
use braces if there is one line of code with a comment block in my
opinion.  And there is no need for braces if there is no comment.
--
Michael


signature.asc
Description: PGP signature

Strange corruption in psql output on mereswine

2018-11-08 Thread Thomas Munro

Hi,

The last two runs show strange but identical (?) corruption in psql
output on mereswine (Debian on ARMv7):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-08%2009%3A10%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-07%2011%3A58%3A07

-- 
Thomas Munro
http://www.enterprisedb.com

RE: Copy data to DSA area

2018-11-08 Thread Ideriha, Takeshi

Hi, thank you for all the comment. 
It's really helpful.

>From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>Sent: Wednesday, November 7, 2018 1:35 PM
>
>On Wed, Nov 7, 2018 at 3:34 PM Ideriha, Takeshi 
>
>wrote:
>> Related to my development (putting relcache and catcache onto shared
>> memory)[1],
>>
>> I have some questions about how to copy variables into shared memory, 
>> especially
>DSA area, and its implementation way.
>>
>> Under the current architecture when initializing some data, we
>> sometimes copy certain data using some specified functions
>>
>> such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on.
>> These copy functions calls palloc() and allocates the
>>
>> copied data into current memory context.
>
>Yeah, I faced this problem in typcache.c, and you can see the function
>share_typledesc() which copies TupleDesc objects into a DSA area.
>This doesn't really address the fundamental problem here though... see below.

I checked share_tupledesc(). My original motivation came from copying tupledesc 
with constraint like CreateTupleDescCopyConstr().
But yes, as you stated here and bellow without having to copy
tupledesc constraint, this method makes sense.

>
>> But on the shared memory, palloc() cannot be used. Now I'm trying to
>> use DSA (and dshash) to handle data on the shared memory
>>
>> so for example dsa_allocate() is needed instead of palloc(). I hit upon 
>> three ways to
>implementation.
>>
>> A. Copy existing functions and write equivalent DSA version copy
>> functions like CreateDSATupleDescCopyConstr(),
>>
>>datumCopyDSA(), dsa_strdup()
>>
>>In these functions the logic is same as current one but would be replaced 
>> palloc()
>with dsa_allocate().
>>
>>But this way looks too straight forward and code becomes less readable and
>maintainable.
>>
>> B. Using current functions and copy data on local memory context temporarily 
>> and
>copy it again to DSA area.
>>
>>This method looks better compared to the plan A because we don't need to 
>> write
>clone functions with copy-paste.
>>
>>But copying twice would degrade the performance.
>
>It's nice when you can construct objects directly at an address supplied by 
>the caller.
>In other words, if allocation and object initialization are two separate 
>steps, you can
>put the object anywhere you like without copying.  That could be on the stack, 
>in an
>array, inside another object, in regular heap memory, in traditional shared 
>memory, in
>a DSM segment or in DSA memory.  I asked for an alloc/init separation in the 
>Bloom
>filter code for the same reason.  But this still isn't the real problem here...

Yes, actually I tried to create a new function TupleDescCopyConstr() which is 
almost same 
as TupleDescCopy() except also copying constraints. This is supposed to 
separate allocation 
and initialization. But as you pointed out bellow, I had to manage object graph 
with pointes 
and faced the difficulty.

>> C. Enhance the feature of palloc() and MemoryContext.
>>
>>This is a rough idea but, for instance, make a new global flag to tell 
>> palloc() to
>use DSA area instead of normal MemoryContext.
>>
>>MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to
>allocate memory to DSA.
>>
>>And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as 
>> normal
>one.
>>
>>MemoryContextSwitchToDSA(dsa_area);
>>
>>palloc(size);
>>
>>MemoryContextSwitchBack(dsa_area);
>>
>> Plan C seems a handy way for DSA user because people can take advantage of
>existing functions.
>
>The problem with plan C is that palloc() has to return a native pointer, but 
>in order to
>do anything useful with this memory (ie to share it) you also need to get the
>dsa_pointer, but the palloc() interface doesn't allow for that.  Any object 
>that holds a
>pointer allocated with DSA-hiding-behind-palloc() will be useless for another 
>process.

Agreed. I didn't have much consideration on this point.

>> What do you think about these ideas?
>
>The real problem is object graphs with pointers.  I solved that problem for 
>TupleDesc
>by making them *almost* entirely flat, in commit c6293249.  I say 'almost' 
>because it
>won't work for constraints or defaults, but that didn't matter for the 
>typcache.c case
>because it doesn't use those.  In other words I danced carefully around the 
>edge of
>the problem.
>
>In theory, object graphs, trees, lists etc could be implemented in a way that 
>allows for
>"flat" storage, if they can be allocated in contiguous memory and refer to 
>sub-objects
>within that space using offsets from the start of the space, and then they 
>could be used
>without having to know whether they are in DSM/DSA memory or regular memory.
>That seems like a huge project though.  Otherwise they have to hold 
>dsa_pointer, and
>deal with that in many places.  You can see this in the Parallel Hash code.  I 
>had to
>make the hash table data structure able to deal wi

Re: Cache relation sizes?

2018-11-08 Thread Edmund Horner

On Wed, 7 Nov 2018 at 11:41, Thomas Munro  wrote:
>
> Hello,
>
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot.  For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).
>
> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation.  Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

On behalf of those looking after databases running over NFS (sigh!), I
think this is definitely worth exploring.  Looking at the behaviour of
my (9.4.9) server, there's an lseek(SEEK_END) for every relation
(table or index) used by a query, which is a lot of them for a heavily
partitioned database.  The lseek counts seem to be the same with
native partitions and 10.4.

As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT
0" on a table with 600 partitions, which builds a
MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes
600ms to return when repeated.  (It's much slower the first time,
because the backend has to open the files, and read index blocks.  I
found that increasing max_files_per_process above the number of
tables/indexes in the query made a huge difference, too!)  Testing
separately, 1200 lseeks on that NFS mount takes around 400ms.

I'm aware of other improvements since 9.4.9 that would likely improve
things (the pread/pwrite change; possibly using native partitioning
instead of inheritance), but I imagine reducing lseeks would too.

(Of course, an even better improvement is to not put your data
directory on an NFS mount (sigh).)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 04:00:26PM +0900, Michael Paquier wrote:
> Yes, I don't disagree with you and I thought about it.  Fetching the
> value from the control file is easy, doing the comparison between two
> LSNs is also simple by doing it directly with pg_lsn in the database
> (and I don't want to add math logics about LSNs as a TAP API).  Now I am
> less sure about how portable it is possible to make the read of 8 bytes
> on the page header for the last page of a relation portable across many
> architectures in perl.

Has a perl monk a specific idea here?  It seems to me that the amount of
specific facility which would be needed overweights the current
simplicity of the test.
--
Michael


signature.asc
Description: PGP signature

Re: Tid scan improvements

2018-11-08 Thread Edmund Horner

On Tue, 6 Nov 2018 at 16:40, David Rowley  wrote:
> I've been looking over 0001 to 0003. I ran out of steam before 0004.

Hi David, thanks for another big review with lots of improvements.

> I like the design of the new patch. From what I threw so far at the
> selectivity estimation code, it seems pretty good.  I also quite like
> the design in nodeTidscan.c for range scans.


> I didn't quite manage to wrap my head around the code that removes
> redundant quals from the tidquals. For example, with:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
> QUERY PLAN
> --
>  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
>TID Cond: (ctid <= '(0,10)'::tid)
>Filter: (a = 0)
> (3 rows)
>
> and:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> and ctid >= '(0,0)';
>   QUERY PLAN
> --
>  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
>TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
>Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> (3 rows)
>
> I understand why the 2nd query didn't remove the ctid quals from the
> filter, and I understand why the first query could. I just didn't
> manage to convince myself that the code behaves correctly for all
> cases.

I agree it's not obvious.

1. We extract a set of tidquals that can be directly implemented by
the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
(...)" (with some limitations).
2. If they happened to come verbatim from the original RestrictInfos,
then they will be found in scan_clauses, and we can remove them.
3. If they're not verbatim, i.e. the original RestrictInfos have
additional criteria that the Tid scan can't use, then tidquals won't
match anything in scan_clauses, and hence scan_clauses will be
unchanged.
4. We do a bit of extra work for the common and useful case of "(CTID
op ? AND ...)".  Since the top-level operation of the input quals is
an AND, it will typically be split into multiple RestrictInfo items.
We remove each part from scan_clauses.

> 1. I see a few instances of:
>
> #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> #define ItemPointerGetDatum(X) PointerGetDatum(X)
>
> in both tid.c and ginfuncs.c, and I see you have:
>
> + itemptr = (ItemPointer) DatumGetPointer(constval);
>
> Do you think it would be worth moving the macros out of tid.c and
> ginfuncs.c into postgres.h and use that macro instead?
>
> (I see the code in this file already did this, so it might not matter
> about this)

I'm not sure about this one - - I think it's better as a separate
patch, since we'd also change ginfuncs.c.  I have left it alone for
now.

> 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> can just return MakeTidRangeQuals(found_quals); or return NIL.

Yup, gone.

> 3. Can you explain why this only needs to take place when list_length() == 1?
>
> /*
> * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> * the various parts will have come from different RestrictInfos.  So
> * remove each part separately.
> */
> ...

I've tried to improve the comment.

> 4. Accidental change?
>
> - tidquals);
> + tidquals
> + );
>
> 5. Shouldn't this comment get changed?
>
> - * NumTidsnumber of tids in this scan
> + * NumRangesnumber of tids in this scan
>
> 6. There's no longer a field named NumTids
>
> - * TidListevaluated item pointers (array of size NumTids)
> + * TidRangesevaluated item pointers (array of size NumTids)
>
> 7. The following field is not documented in TidScanState:
>
> + bool tss_inScan;
>
> 8. Can you name this exprtype instead?
>
> + TidExprType type; /* type of op */
>
> "type" is used by Node types to indicate their type.

Yup, yup, yup, yup, yup.

> 9. It would be neater this:
>
> if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> else if (expr->opno == TIDGreaterOperator || expr->opno == 
> TIDGreaterEqOperator)
> tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> else
> tidopexpr->type = TIDEXPR_EQ;
>
> tidopexpr->exprstate = exprstate;
>
> tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> == TIDGreaterEqOperator;
>
> as a switch: ...

Yup, I think the switch is a bit nicer.

> 10. I don't quite understand this comment:
>
> + * Create an ExprState corresponding to the value part of a TID comparison,
> + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> + * appropriately, depending on the operator and position of the its 
> arguments.
>
> I don't quite see how the code sets the inclusivity depending on the
> position of the arguments.
>
> Maybe the comment should be:
>
> + * For the given 'expr' build and return

Re: Copy data to DSA area

2018-11-08 Thread Thomas Munro

On Fri, Nov 9, 2018 at 2:19 PM Ideriha, Takeshi
 wrote:
> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> >I know of 3 ideas that would make your idea C work, so that you could share
> >something as complicated as a query plan directly without having to 
> >deserialise it to
> >use it:
> >
> >1.  Allow the creation of DSA areas inside the traditional fixed memory 
> >segment
> >(instead of DSM), in a fixed-sized space reserved by the postmaster.  That 
> >is, use
> >dsa.c's ability to allocate and free memory, and possibly free a whole area 
> >at once to
> >avoid leaking memory in some cases (like MemoryContexts), but in this mode
> >dsa_pointer would be directly castable to a raw pointer.  Then you could 
> >provide a
> >regular MemoryContext interface for it, and use it via palloc(), as you 
> >said, and all the
> >code that knows how to construct lists and trees and plan nodes etc would 
> >All Just
> >Work.  It would be your plan C, and all the pointers would be usable in 
> >every process,
> >but limited in total size at start-up time.
> >
> >2.  Allow regular DSA in DSM to use raw pointers into DSM segments, by 
> >mapping
> >segments at the same address in every backend.  This involves reserving a 
> >large
> >virtual address range up front in the postmaster, and then managing the 
> >space,
> >trapping SEGV to map/unmap segments into parts of that address space as 
> >necessary
> >(instead of doing that in dsa_get_address()).  AFAIK that would work, but it 
> >seems to
> >be a bit weird to go to such lengths.  It would be a kind of home-made 
> >simulation of
> >threads.  On the other hand, that is what we're already doing in dsa.c, 
> >except more
> >slowly due to extra software address translation from funky pseudo-addresses.
> >
> >3.  Something something threads.
>
> I'm thinking to go with plan 1. No need to think about address translation
> seems tempting. Plan 2 (as well as plan 3) looks a big project.

The existing function dsa_create_in_place() interface was intended to
support that, but has never been used in that way so I'm not sure what
extra problems will come up.  Here are some assorted thoughts:

* You can prevent a DSA area from creating extra DSM segments, so that
it is constrained to stay entirely in the space you give it, by
calling dsa_set_size_limit(area, size) using the same size that you
gave to dsa_create_in_place(); now you have a DSA area that manages a
single fixed-sized chunk of memory that you gave it, in your case
inside the traditional shared memory segment (but it could be
anywhere, including inside a DSM segment or another DSA area!)

* You can probably write a MemoryContext wrapper for it, if it has
only one segment that is in the traditional shared memory segment.
You would need to do very simple kind of address translation: the
result from palloc() needs to be base + dsa_allocate()'s result, and
the argument to pfree() needs to be subtracted from base when
dsa_free() is called.  That is a version of your idea C that should
work AFAIK.

* Once you have that working, you now have a new kind of resource
management problem on your hands: memory leaks will be cluster-wide
and cluster-life-time!  That's hard, because the goal is to be able to
use arbitrary code in the tree that deals with plans etc, but that
code all assumes that it can "throw" (elog()) on errors.  PostgreSQL C
is generally "garbage collected" (in a way), but in this sketch, that
doesn't work anymore: this area *never* goes out of scope and gets
cleaned up.  Generally, languages with exceptions either need garbage
collection or scoped destructors to clean up the mess, but in this
sketch we don't have that anymore...  much like allocating stuff in
TopMemoryContext, except worse because it doesn't go away when one
backend exits.

* I had some ideas about some kind of "allocation rollback" interface:
you begin an "allocation transaction", allocate a bunch of stuff
(perhaps indirectly, by calling some API that makes query plans or
whatever and is totally unaware of this stuff).  Then if there is an
error, whatever was allocated so far is freed in the usual cleanup
paths by a rollback that happens via the resource manager machinery.
If you commit, then the allocation becomes permanent.  Then you only
commit stuff that you promise not to leak (perhaps stuff that has been
added to a very carefully managed cluster-wide plan cache).  I am not
sure of the details, and this might be crazy...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Performance improvements of INSERTs to a partitioned table

2018-11-08 Thread David Rowley

On 7 November 2018 at 21:31, Kato, Sho  wrote:
> AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are
> executed at the same time, this patch causes deadlock.
>
> * partitions information
>
> Partition key: RANGE (a)
> Partitions: a_1 FOR VALUES FROM (1) TO (100),
> a_2 FOR VALUES FROM (100) TO (200)
>
> T1: create index a_1_ix on a_1(a);
> T2: insert into a values(101),(1);  locking a_2 and waiting releasing a_1’s
> lock
> T1: create index a_2_ix on a_2(a); ERROR:  deadlock detected
>
> I think this situation does not mean unsafe because similar situation will
> occurs at no partitioned tables and DBMS could not prevent this situation.
>
> But, I'm not sure this behavior is correct.

That's one case that will deadlock with the 0002 patch in [1]. Another
command that takes a conflicting lock on the partition only is
TRUNCATE. Many other commands that normally take an AEL can't be run
on a partition, for example, ALTER TABLE ADD COLUMN.

The same would have deadlocked on master if you'd created the indexes
in the reverse order, but I guess the point is that today there's some
defined order that you can create the indexes in that won't cause a
deadlock, but with [1] there is no defined order since the tables are
locked in order that tuples are routed to them.

Robert pointed out something interesting in [2] about UPDATEs of a
partition key perhaps routing a tuple to a partition that's been
excluded by constraint exclusion, so the lock is not taken initially,
but would be taken later in ExecSetupPartitionTupleRouting(). I ran
through that scenario today and discovered it can't happen as excluded
partitions are still in the rangetable, so are still locked either in
the planner or by AcquireExecutorLocks() and the order those are
locked in is well defined from the planner. However, I believe that's
something Amit plans to change in [3], where he proposes to only lock
partitions that survive partition pruning (among many other things).

There are probably a few things we could do here:

a. Have all DDL on partitions obtain the same lock level on all
ancestor partitioned tables taking a lock on the root first and
working down to the leaf.
b. Just lock what we touch and increase the likelihood of deadlocks occurring.
c. Reject [1] and [3] because we don't want to increase the chances of
deadlocks.

I'm highly against doing 'c' since both [1] and [3] are very useful
patches which scale partitioning to work well with very large numbers
of partitions. At the moment we just can bearly do half a dozen
partitions before the performance starts going south. 'a' is not that
ideal a solution either as it means that anyone doing CREATE INDEX or
TRUNCATE on a partition would conflict with queries that are querying
an unrelated part of the partitioned table.  While I don't really like
'b' either, I wonder how many people are going to hit deadlocks here.
To hit that, I believe that you'd have to be doing the TRUNCATE or
CREATE INDEX inside a transaction and to hit it where you couldn't hit
it before you'd have had to carefully written your CREATE INDEX /
TRUNCATE statements to ensure they're executed in partition order. I'm
unsure how likely that is, but I certainly can't rule out that doing
'b' won't upset anyone.

Perhaps a GUC is needed to choose between 'a' and 'b'?

I'm adding Amit to this email because [3] is his and Robert because
he's recently been talking about partition locking too.

[1] 
https://www.postgresql.org/message-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG%3DCs2UwhsD6cqwg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CA%2BTgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q%40mail.gmail.com
[3] https://commitfest.postgresql.org/20/1778/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Removal of unnecessary CommandCounterIncrement() when doing ON COMMIT DELETE ROWS

2018-11-08 Thread Michael Paquier

Hi all,

When doing a set of ON COMMIT DELETE ROWS actions for relations, there
is a CCI happening after each truncation:
@@ -13334,10 +13334,8 @@ PreCommit_on_commit_actions(void)
* exists at truncation time.
*/
if (oids_to_truncate != NIL)
-   {
heap_truncate(oids_to_truncate);
-   CommandCounterIncrement();  /* XXX needed? */
-   }
  
This has been visibly introduced by f9b5b41 back in 2002 which reworked
how ON COMMIT works.

Alvaro has mentioned that it would not be needed here:
20181106125337.o23kjdv546bu2tei@alvherre.pgsql
And I think that I agree with that, because visibly this applies to
index rebuilds but in those cases CCIs are happening locally.  So I
think that we can get rid of that, and I suggest to remove it only on
HEAD only of course.

Any objections?
--
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 82989158ee..946119fa86 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13334,10 +13334,8 @@ PreCommit_on_commit_actions(void)
 	 * exists at truncation time.
 	 */
 	if (oids_to_truncate != NIL)
-	{
 		heap_truncate(oids_to_truncate);
-		CommandCounterIncrement();	/* XXX needed? */
-	}
+
 	if (oids_to_drop != NIL)
 	{
 		ObjectAddresses *targetObjects = new_object_addresses();


signature.asc
Description: PGP signature

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-08 Thread Amit Kapila

On Thu, Nov 8, 2018 at 10:56 PM Magnus Hagander  wrote:
>
> On Thu, Nov 8, 2018 at 3:53 PM Sergei Kornilov  wrote:
>>
>> Hi
>>
>> >>  Sure, but what are we going to achieve with that number? What
>> >>  information user is going to get by that? If it can help us to ensure
>> >>  that it has reset the expected number of statements, then I can see
>> >>  the clear usage, but without that, the return value doesn't seem to
>> >>  have any clear purpose. So, I don't see much value in breaking
>> >>  compatibility.
>> >>
>> >>  Does anyone else have an opinion on this matter?
>> >
>> > This was proposed by Sergei Kornilov in
>> > https://postgr.es/m/3368121530260...@web21g.yandex.ru saying that "it
>> > would be nice" to return it. Maybe he has an use case in mind? I don't
>> > see one myself.
>> No, i have no specific usecase for this. Silently remove all matching rows 
>> and return void is ok for me. But i still think LOG ereport is not useful.
>
>
> I would much prefer it to be a return code rather than a forced LOG message. 
> Log message spam is very much a thing, and things that are logged as LOG will 
> always be there.
>

Is any such LOG message present in the latest patch?  I agree that the
return code might be better, but there doesn't exist any such (LOG)
thing.  I see that it can be helpful for some users if we have any
such return code, but don't know if it doesn't already exist, why that
should be a requirement for this patch?  Do you have any strong
opinion about introducing return code with this patch?

> It could also be made to take a parameter saying log yes/no with a default 
> value, but that seems like possible overengineering of a fairly simple 
> functionality.
>

Right.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Strange corruption in psql output on mereswine

2018-11-08 Thread Tom Lane

Thomas Munro  writes:
> The last two runs show strange but identical (?) corruption in psql
> output on mereswine (Debian on ARMv7):

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-08%2009%3A10%3A14
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-07%2011%3A58%3A07

Note that the corruption is in the *expected* file, not the test result
file.  I think this must mean that mereswine's git repo has become
corrupted.  I'd recommend "rm -rf" on the repo and let the buildfarm
script pull down a fresh copy next time it runs.

regards, tom lane

Re: Cache relation sizes?

2018-11-08 Thread David Rowley

On 7 November 2018 at 11:46, Andres Freund  wrote:
> Hi,
>
> On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
>> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
>> lot.  For example, a fully prewarmed pgbench -S transaction consists
>> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
>> lseek() is probably about as cheap as a syscall can be so I doubt it
>> really costs us much, but it's still a context switch and it stands
>> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
>> calls are gone (commit c24dcd0cfd).
>
> I'd really really like to see some benchmarking before embarking on a
> more complex scheme.  I aesthetically dislike those lseeks, but ...

I agree. It would be good to see benchmarks on this first.  Those
could be as simple as just some crude local backend cache that stuff
the return value of RelationGetNumberOfBlocks in estimate_rel_size
into a hashtable and does not take into account the fact that it might
change. Should be okay to do some read-only benchmarking.

The partitioning case is probably a less interesting case to improve
if we get [1] as we'll no longer ask for the size of any pruned
partitions. Queries that don't prune any partitions are less likely to
notice the extra overhead of the lseek(SEEK_END) since they've
probably got more work to do elsewhere.

[1] https://commitfest.postgresql.org/20/1778/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PostgreSQL vs SQL/XML Standards

2018-11-08 Thread Pavel Stehule

čt 8. 11. 2018 v 15:18 odesílatel Markus Winand 
napsal:

>
> > On 2018-11-6, at 15:23 , Pavel Stehule  wrote:
> >
> >
> >
> > po 29. 10. 2018 v 11:45 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com> napsal:
> >
> >
> > po 29. 10. 2018 v 10:11 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com> napsal:
> > Hi
> >
> > čt 25. 10. 2018 v 21:47 odesílatel Alvaro Herrera <
> alvhe...@2ndquadrant.com> napsal:
> > On 2018-Oct-25, Pavel Stehule wrote:
> >
> > > I am thinking so I can fix some issues related to XMLTABLE. Please,
> send me
> > > more examples and test cases.
> >
> > Please see Markus Winand's patch that I referenced upthread.
> >
> > here is a fix of some XMLTABLE mentioned issues.
> >
> > this update allows cast boolean to numeric types from XPath expressions
> >
> > Attached patch solves some cast issues mentioned by Chap. It solves
> issue reported by Markus. I didn't use Markus's code, but it was
> inspiration for me. I found native solution from libxml2.
> >
> > Regards
> >
> > Pavel
>
> Better than my patch.
>
> But I think the chunk in xml_xmlnodetoxmltype of my patch is still needed
> — in one way or the other (see below).
>
> # select * from xmltable('*' PASSING 'pre arg?>&deeppost' COLUMNS x XML PATH
> 'node()');
> x
> -
>  prec1arg&ent1&deeppost
> (1 row)
>
> Output is not the original XML.
>
> I dug a little further and found another case that doesn’t looks right
> even with my change to xml_xmlnodetoxmltype applied:
>
> # select * from xmltable('*' PASSING 'pre arg?>&deeppost' COLUMNS x XML PATH '/');
>  x
> ---
>  pre&ent1&deeppost
> (1 row)
>
> Oracle gives in both cases XML.
>
> To fix that I included XML_DOCUMENT_NODE in the list of nodes that use
> xmlNodeDump. Now I wonder if that logic should be reversed to use the
> xmlXPathCastNodeToString branch in a few selected cases but default to the
> branch xmlNodeDump for all other cases?
>
> I guess those few cases might be XML_ATTRIBUTE_NODE and XML_TEXT_NODE.
> Regression tests are happy with that approach but I don’t think that proves
> a lot.
>
> -markus
>
> diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
> index 37d85f7..7c1f884 100644
> --- a/src/backend/utils/adt/xml.c
> +++ b/src/backend/utils/adt/xml.c
> @@ -3682,7 +3682,7 @@ xml_xmlnodetoxmltype(xmlNodePtr cur,
> PgXmlErrorContext *xmlerrcxt)
>  {
> xmltype*result;
>
> -   if (cur->type == XML_ELEMENT_NODE)
> +   if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
> {
> xmlBufferPtr buf;
> xmlNodePtr  cur_copy;
>
>
I used your patch and append regress tests. I checked the result against
Oracle.

Regards

Pavel
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 37d85f71f3..7bed508e2a 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3674,15 +3674,15 @@ SPI_sql_row_to_xmlelement(uint64 rownum, StringInfo result, char *tablename,
 #ifdef USE_LIBXML
 
 /*
- * Convert XML node to text (dump subtree in case of element,
- * return value otherwise)
+ * Convert XML node to text (dump subtree), for attribute and text
+ * returns escaped text.
  */
 static text *
 xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
 {
 	xmltype*result;
 
-	if (cur->type == XML_ELEMENT_NODE)
+	if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
 	{
 		xmlBufferPtr buf;
 		xmlNodePtr	cur_copy;
@@ -4427,6 +4427,35 @@ XmlTableFetchRow(TableFuncScanState *state)
 #endif			/* not USE_LIBXML */
 }
 
+/*
+ * Copy XmlChar string to PostgreSQL memory. Ensure releasing of
+ * source xmllib string.
+ */
+static char *
+copy_and_safe_free_xmlchar(xmlChar *str)
+{
+	char *result;
+
+	if (str)
+	{
+		PG_TRY();
+		{
+			result = pstrdup((char *) str);
+		}
+		PG_CATCH();
+		{
+			xmlFree(str);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+		xmlFree(str);
+	}
+	else
+		result = NULL;
+
+	return result;
+}
+
 /*
  * XmlTableGetValue
  *		Return the value for column number 'colnum' for the current row.  If
@@ -4490,85 +4519,72 @@ XmlTableGetValue(TableFuncScanState *state, int colnum,
 			{
 *isnull = true;
 			}
-			else if (count == 1 && typid == XMLOID)
-			{
-text	   *textstr;
-
-/* simple case, result is one value */
-textstr = xml_xmlnodetoxmltype(xpathobj->nodesetval->nodeTab[0],
-			   xtCxt->xmlerrcxt);
-cstr = text_to_cstring(textstr);
-			}
-			else if (count == 1)
+			else
 			{
-xmlChar*str;
-xmlNodePtr	node;
-
-/*
- * Most nodes (elements and even attributes) store their data
- * in children nodes. If they don't have children nodes, it
- * means that they are empty (e.g. ). Text nodes and
- * CDATA sections are an exception: they don't have children
- * but have content in the Text/CDATA node itself.
- */
-node = xpathobj->nodesetval->nodeTab[0];
-

Re: notice processors for isolationtester

2018-11-08 Thread Haribabu Kommi

On Thu, Oct 25, 2018 at 8:35 AM Alvaro Herrera 
wrote:

> Recently while testing a patch I found it immensely useful to
> distinguish which session each WARNING message came from, when bespoke
> tests were run under isolationtester.  Current code does not show that,
> so I developed this patch adding notice processors, and then it does.
>

I reviewed and tested the patch. It is working as expected.
Providing session name helps in debugging.

Marking the patch as ready for committer.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Strange corruption in psql output on mereswine

2018-11-08 Thread Clarence Ho

Thanks for the heads up.

Just saw "gull" has issue with the build too.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2018-11-09%2003%3A10%3A06

Both "gull" and "mereswine" run on the same machine. Probably wear out the
SD card. Will stop these two animals and do a disk check.

On Thu, Nov 8, 2018 at 7:36 PM Tom Lane  wrote:

> Thomas Munro  writes:
> > The last two runs show strange but identical (?) corruption in psql
> > output on mereswine (Debian on ARMv7):
>
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-08%2009%3A10%3A14
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-07%2011%3A58%3A07
>
> Note that the corruption is in the *expected* file, not the test result
> file.  I think this must mean that mereswine's git repo has become
> corrupted.  I'd recommend "rm -rf" on the repo and let the buildfarm
> script pull down a fresh copy next time it runs.
>
> regards, tom lane
>

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Amit Langote

On 2018/11/09 1:38, Alvaro Herrera wrote:
> On 2018-Nov-09, Amit Langote wrote:
> 
>> On Fri, Nov 9, 2018 at 1:03 AM Alvaro Herrera  
>> wrote:
>>> On 2018-Nov-07, Amit Langote wrote:
> 
>>> Hmm, I'm thinking perhaps we shouldn't backpatch this part.  It's
>>> obviously a bug, but we might break somebody's working apps.  Therefore
>>> I think I'd rather leave it out of the current bugfix and commit
>>> separately.
>>
>> Okay, that may be fine because nothing wrong is happening by silently
>> ignoring the partition's specified collation.
> 
> Actually, how about we reduce the message to a WARNING in released
> branches, and ERROR in master?  Broken applications would continue to
> work, but users might become aware that there might be a problem.

That might work too.  Would you like me to create and post patches like
that for 10 and 11 branches?

Thanks,
Amit

Re: Resetting PGPROC atomics in ProcessInit()

2018-11-08 Thread Amit Kapila

On Thu, Nov 8, 2018 at 4:38 PM Amit Kapila  wrote:
> On Sun, Nov 4, 2018 at 6:30 AM Robert Haas  wrote:
> > On Sat, Oct 27, 2018 at 6:41 AM Andres Freund  wrote:
> > > I just noticed, while working on a patch adding things to PGPROC, that
> > > the group clearning patches for the proc array and clog reset atomics in
> > > InitProcess().
> > >
> > > I'm not a big fan of that, because it means that it's not safe to look
> > > at the atomics of backends that aren't currently in use.  Is there any
> > > reason to not instead initialize them in InitProcGlobal() and just
> > > assert in InitProcess() that they're 0?  If they're not, we'd be in deep
> > > trouble anyway, no?
> >
> > I think you are correct.  I think it would be better in general for
> > InitProcess() to Assert() rather than reinitializing.
> >
>
> Okay, changed the code as per Andres's and your suggestion.  Do you
> think the attached change makes sense? I think we should backpatch
> this.
>

For 10 and 9.6, we need a slightly different patch as the change of
group clog update went in 11.  Attached are the patches for the same,
note that there is a slight change in the commit message for the patch
written for 10 and 9.6.  I will wait for a few days (till Tuesday@8:00
AM IST) to see if somebody has any comments or want to review before I
push.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-the-initialization-of-atomic-variables-introduce.patch
Description: Binary data


0001-Fix-the-initialization-of-atomic-variable-introduced-10.patch
Description: Binary data

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Amit Langote

On 2018/11/09 4:39, Alvaro Herrera wrote:
> Pushed.

Thank you for committing.  I've closed the CF entry.

> I included the test case for collations to the three branches, but no
> code changes.  We can patch master for the handling of collations per
> your patch,

Okay, but should we back-patch it by adding WARNING to back-branches as
you suggest?

> and if somebody has it, a change to how defaults are applied
> when routing tuples.

I haven't written such a patch yet.  Do we want such a feature?

Thanks,
Amit

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-08 Thread Amit Langote

On 2018/11/09 14:04, Amit Langote wrote:
> On 2018/11/09 4:39, Alvaro Herrera wrote:
>> and if somebody has it, a change to how defaults are applied
>> when routing tuples.
> 
> I haven't written such a patch yet.  Do we want such a feature?

Or is it a *bug* of tuple-routing that it doesn't substitute default
values that may be defined for partitions?  It kind of looks like one if
you see an example like this.

create table p (a int, b int) partition by list (a);
create table p1 partition of p (b not null default 1) for values in (1);
insert into p1 values (1);
table p;
 a │ b
───┼───
 1 │ 1
(1 row)

insert into p values (1);
ERROR:  null value in column "b" violates not-null constraint
DETAIL:  Failing row contains (1, null).

Thanks,
Amit

Re: using expression syntax for partition bounds

2018-11-08 Thread Amit Langote

Rebased due to change in addRangeTableEntryForRelation's API.

Thanks,
Amit
>From 2c9bd7d17abea93001c923ac200c560417cd39a1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 6 Jul 2018 14:05:22 +0900
Subject: [PATCH v6] Allow generalized expression syntax for partition bounds

Authors: Kyotaro Horiguchi, Tom Lane, Amit Langote
---
 doc/src/sgml/ref/alter_table.sgml  |   6 +-
 doc/src/sgml/ref/create_table.sgml |  16 +--
 src/backend/commands/tablecmds.c   |   9 ++
 src/backend/optimizer/util/clauses.c   |   4 +-
 src/backend/parser/gram.y  |  60 +
 src/backend/parser/parse_agg.c |  10 ++
 src/backend/parser/parse_expr.c|   5 +
 src/backend/parser/parse_func.c|   3 +
 src/backend/parser/parse_utilcmd.c | 201 ++---
 src/include/optimizer/clauses.h|   2 +
 src/include/parser/parse_node.h|   1 +
 src/include/utils/partcache.h  |   6 +
 src/test/regress/expected/create_table.out |  49 ---
 src/test/regress/sql/create_table.sql  |  16 ++-
 14 files changed, 222 insertions(+), 166 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index f13a6cd944..7ec9dc4537 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -87,9 +87,9 @@ ALTER TABLE [ IF EXISTS ] name
 
 and partition_bound_spec 
is:
 
-IN ( { numeric_literal | 
string_literal | TRUE | FALSE | 
NULL } [, ...] ) |
-FROM ( { numeric_literal | 
string_literal | TRUE | FALSE | 
MINVALUE | MAXVALUE } [, ...] )
-  TO ( { numeric_literal | 
string_literal | TRUE | FALSE | 
MINVALUE | MAXVALUE } [, ...] ) |
+IN ( partition_bound_expr [, ...] 
) |
+FROM ( { partition_bound_expr | 
MINVALUE | MAXVALUE } [, ...] )
+  TO ( { partition_bound_expr | 
MINVALUE | MAXVALUE } [, ...] ) |
 WITH ( MODULUS numeric_literal, 
REMAINDER numeric_literal )
 
 and column_constraint 
is:
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 4b9c8a7801..aa60f60e6a 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -86,9 +86,9 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] 
TABLE [ IF NOT EXI
 
 and partition_bound_spec 
is:
 
-IN ( { numeric_literal | 
string_literal | TRUE | FALSE | 
NULL } [, ...] ) |
-FROM ( { numeric_literal | 
string_literal | TRUE | FALSE | 
MINVALUE | MAXVALUE } [, ...] )
-  TO ( { numeric_literal | 
string_literal | TRUE | FALSE | 
MINVALUE | MAXVALUE } [, ...] ) |
+IN ( partition_bound_expr [, ...] 
) |
+FROM ( { partition_bound_expr | 
MINVALUE | MAXVALUE } [, ...] )
+  TO ( { partition_bound_expr | 
MINVALUE | MAXVALUE } [, ...] ) |
 WITH ( MODULUS numeric_literal, 
REMAINDER numeric_literal )
 
 index_parameters in 
UNIQUE, PRIMARY KEY, and 
EXCLUDE constraints are:
@@ -412,12 +412,10 @@ WITH ( MODULUS numeric_literal, REM
  
 
  
-  Each of the values specified in
-  the partition_bound_spec is
-  a literal, NULL, MINVALUE, or
-  MAXVALUE.  Each literal value must be either a
-  numeric constant that is coercible to the corresponding partition key
-  column's type, or a string literal that is valid input for that type.
+  partition_bound_expr is
+  any variable-free expression (subqueries, window functions, aggregate
+  functions, and set-returning functions are not allowed).  Its data type
+  must match the data type of the corresponding partition key column.
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 82989158ee..421d50bb9f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -797,6 +797,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
defaultPartOid;
Relationparent,
defaultRel = NULL;
+   RangeTblEntry *rte;
 
/* Already have strong enough lock on the parent */
parent = heap_open(parentId, NoLock);
@@ -839,6 +840,14 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
pstate = make_parsestate(NULL);
pstate->p_sourcetext = queryString;
 
+   /*
+* Add an RTE containing this relation, so that transformExpr 
called
+* on partition bound expressions is able to report errors 
using a
+* proper context.
+*/
+   rte = addRangeTableEntryForRelation(pstate, rel, 
AccessShareLock,
+   
NULL, false, false);
+   addRTEtoQuery(pstate, rte, false, true, true);
bound = transformPartitionBound(pstate, parent, 
stmt->partbound);
 
/*
diff --git a/src/backend/optimizer/u

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-08 Thread Kyotaro HORIGUCHI

At Thu, 08 Nov 2018 21:52:31 +0900, Etsuro Fujita  
wrote in <5be4318f.4040...@lab.ntt.co.jp>
> (2018/11/08 10:50), Thomas Munro wrote:
> > I take back what I said earlier about false positives from other
> > pipes.  I think it's only traditional Unix programs designed for use
> > in pipelines and naive programs that let SIGPIPE terminate the
> > process.   The accepted answer here gives a good way to think about
> > it:
> >
> > https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist
> 
> Thanks for the information!
>
> > A program sophisticated enough to be writing to other pipes is no
> > longer in that category and should be setting up signal dispositions
> > itself, so I agree that we should enable the default disposition and
> > ignore WTERMSIG(exit_code) == SIGPIPE, as proposed.  That is pretty
> > close to the intended purpose of that signal AFAICS.
> 
> Great!
>
> >>> In the sense of "We don't care the reason", negligible reasons
> >>> are necessariry restricted to SIGPIPE, evan SIGSEGV could be
> >>> theoretically ignored safely. "theoretically" here means it is
> >>> another issue whether we rely on the output from a program which
> >>> causes SEGV (or any reason other than SIGPIPE, which we caused).
> >>
> >> For the SIGSEGV case, I think it would be better that we don't rely on
> >> the output data, IMO, because I think there might be a possibility
> >> that
> >> the program have generated that data incorrectly/unexpectedly.
> >
> > +1
> >
> > I don't think we should ignore termination by signals other than
> > SIGPIPE: that could hide serious problems from users.  I want to know
> > if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
> > happens after we read enough data; there is a major problem that a
> > human needs to investigate!
> 
> I think so too.

Ok, I can live with that with no problem.

> >>> As the result it doesn't report an error for SELECT * FROM ft2
> >>> LIMIT 1 on "main(void){puts("test1"); return 1;}".
> >>>
> >>> =#  select * from ft limit 1;
> >>>  a
> >>> ---
> >>>test1
> >>> (1 row)
> >>>
> >>> limit 2 reports the error.
> >>>
> >>> =# select * from ft limit 2;
> >>> ERROR:  program "/home/horiguti/work/exprog" failed
> >>> DETAIL:  child process exited with exit code 1
> >>
> >> I think this would be contrary to users expectations: if the SELECT
> >> command works for limit 1, they would expect that the command would
> >> work
> >> for limit 2 as well.  So, I think it would be better to error out that
> >> command for limit 1 as well, as-is.
> >
> > I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
> > error.  For LIMIT 1, we got all the rows we wanted, and then we closed
> > the pipe.  If we got a non-zero non-signal exit code, or a signal exit
> > code and it was SIGPIPE (not any other signal!), then we should
> > consider that to be expected.
> 
> Maybe I'm missing something, but the non-zero non-signal exit code
> means that there was something wrong with the called program, so I
> think a human had better investigate that as well IMO, which would
> probably be a minor problem, though.  Too restrictive?

I think Thomas just saying that reading more lines can develop
problems. According to the current discussion, we should error
out if we had SEGV when limit 1.

> > I tried to think of a scenario where the already-received output is
> > truly invalidated by a later error that happens after we close the
> > pipe.  It could be something involving a program that uses something
> > optimistic like serializable snapshot isolation that can later decide
> > that whatever it told you earlier is not valid after all.  Suppose the
> > program is clever enough to expect EPIPE and not consider that to be
> > an error, but wants to tell us about serialization failure with a
> > non-zero exit code.  To handle that, you'd need a way to provide an
> > option to file_fdw to tell it not to ignore non-zero exit codes after
> > close.  This seems so exotic and contrived that it's not worth
> > bothering with for now, but could always be added.
> 
> Interesting!  I agree that such an option could add more flexibility
> in handling the non-zero-exit-code case.

I think the program shoudn't output a line until all possible
output is validated. Once the data source emited a line, the
receiver can't do other than believe that it won't be withdrawn.

> > BTW just for curiosity:
> >
> > perl -e 'for (my $i=0; $i< 100; $i++) { print "$i\n"; }' | head -5
> > Exit code: terminated by SIGPIPE, like seq
> 
> Good to know!

Mmm..I didn't get an error at hand on both CentOS7 and High Sierra.

| $ perl -e 'for (my $i=0; $i< 100; $i++) { print "$i\n"; }' | head -5
...
| 4
| $ echo $?
| 0

> > ruby -e 'for i in 1..100 do puts i; end' | head -5
> > Exit code: 1, like Python
> 
> Sad.  Anyway, thanks a lot for these experiments in addition to the
> previous ones!

ruby reported broken pipe but exit status was 0..

create foreign tab

Re: Performance improvements of INSERTs to a partitioned table

2018-11-08 Thread Amit Langote

On 2018/11/09 11:19, David Rowley wrote:
> On 7 November 2018 at 21:31, Kato, Sho  wrote:
>> AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are
>> executed at the same time, this patch causes deadlock.
>>
>> * partitions information
>>
>> Partition key: RANGE (a)
>> Partitions: a_1 FOR VALUES FROM (1) TO (100),
>> a_2 FOR VALUES FROM (100) TO (200)
>>
>> T1: create index a_1_ix on a_1(a);
>> T2: insert into a values(101),(1);  locking a_2 and waiting releasing a_1’s
>> lock
>> T1: create index a_2_ix on a_2(a); ERROR:  deadlock detected
>>
>> I think this situation does not mean unsafe because similar situation will
>> occurs at no partitioned tables and DBMS could not prevent this situation.
>>
>> But, I'm not sure this behavior is correct.
> 
> That's one case that will deadlock with the 0002 patch in [1]. Another
> command that takes a conflicting lock on the partition only is
> TRUNCATE. Many other commands that normally take an AEL can't be run
> on a partition, for example, ALTER TABLE ADD COLUMN.
>
> The same would have deadlocked on master if you'd created the indexes
> in the reverse order, but I guess the point is that today there's some
> defined order that you can create the indexes in that won't cause a
> deadlock, but with [1] there is no defined order since the tables are
> locked in order that tuples are routed to them.

We cannot control the order in which a user performs operations directly
on partitions.  What we do have to worry about is the order in which
partitions are locked by a particular piece of backend code when a certain
operation is applied to the parent table.  However, if the user had
applied the operation to the parent table instead, then it would've
blocked for any concurrent queries that already had a lock on the parent,
or vice versa, because obviously the table named in the command is locked
before any of its children.

> Robert pointed out something interesting in [2] about UPDATEs of a
> partition key perhaps routing a tuple to a partition that's been
> excluded by constraint exclusion, so the lock is not taken initially> but 
> would be taken later in ExecSetupPartitionTupleRouting(). I ran
> through that scenario today and discovered it can't happen as excluded
> partitions are still in the rangetable, so are still locked either in
> the planner or by AcquireExecutorLocks() and the order those are
> locked in is well defined from the planner.

Yes, expand_inherited_rtentry() would lock *all* partitions even if some
of them might not end up in the final plan due to some being excluded.

> However, I believe that's
> something Amit plans to change in [3], where he proposes to only lock
> partitions that survive partition pruning (among many other things).

With my patch, while the planner would take the locks in deterministic
order on the partitions that it adds to the plan, the executor time
locking order is non-deterministic considering that tuple routing may
encounter multiple, not yet seen, partitions.

> There are probably a few things we could do here:
> 
> a. Have all DDL on partitions obtain the same lock level on all
> ancestor partitioned tables taking a lock on the root first and
> working down to the leaf.
> b. Just lock what we touch and increase the likelihood of deadlocks occurring.
> c. Reject [1] and [3] because we don't want to increase the chances of
> deadlocks.
>
> I'm highly against doing 'c' since both [1] and [3] are very useful
> patches which scale partitioning to work well with very large numbers
> of partitions. At the moment we just can bearly do half a dozen
> partitions before the performance starts going south.

Agreed.

> 'a' is not that
> ideal a solution either as it means that anyone doing CREATE INDEX or
> TRUNCATE on a partition would conflict with queries that are querying
> an unrelated part of the partitioned table.

As you pointed out, many of the commands that require locks that would
conflict with concurrent queries are not allowed to run directly on
partitions anyway.  For operations that *are* allowed, doing 'a' is same
as asking the user to perform the operation on the root parent instead, at
least as far as the concurrency is concerned (user may not want to
*truncate* all partitions, only some.)

> While I don't really like
> 'b' either, I wonder how many people are going to hit deadlocks here.
> To hit that, I believe that you'd have to be doing the TRUNCATE or
> CREATE INDEX inside a transaction and to hit it where you couldn't hit
> it before you'd have had to carefully written your CREATE INDEX /
> TRUNCATE statements to ensure they're executed in partition order. I'm
> unsure how likely that is, but I certainly can't rule out that doing
> 'b' won't upset anyone.

As long as queries involve tuple routing that touches multiple not yet
seen partitions, someone doing conflicting operations directly on multiple
partitions in a transaction will have to be ready to see deadlocks.
Maybe, we

Re: csv format for psql

2018-11-08 Thread Michael Paquier

On Thu, Nov 08, 2018 at 01:58:34PM +0900, Michael Paquier wrote:
> Anyway, I am still going through the patch, so no need to send a new
> version for now.

Okay, I have done a round of more in-depth review, and the patch looks
to be in pretty good shape.

Relying on tuples_only to decide if the header should be printed or not
looks good to me.

+/* check for value being non-empty and with an MB length of 1 */
+if (*value == '\0' || value[PQmblen(value, pset.encoding)] != '\0')

It seems to me that this can just be replaced with that:
if (strlen(value) != 1)

Attached is what I am finishing up with for the moment.  Comments are
welcome.  I am still planning look at that stuff a bit more once again,
particularly the printing part, but I am lacking of time now..
--
Michael
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index a1ca94057b..6d45f25aee 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -151,6 +151,21 @@ EOF
   
 
 
+
+  --csv
+  
+  
+  
+   CSV
+   in psql
+  
+  Switches to CSV (Comma Separated Values)
+  table output mode. This is equivalent to
+  \pset format csv.
+  
+  
+
+
 
   -d dbname
   --dbname=dbname
@@ -2556,6 +2571,19 @@ lo_import 152801
   
   
 
+  
+  fieldsep_csv
+  
+  
+  Specifies the field separator to be used in the
+  CSV format. When the separator appears in a field
+  value, that field is output inside double quotes according to
+  CSV rules. To set a tab as field separator, type
+  \pset fieldsep_csv '\t'. The default is a comma.
+  
+  
+  
+
   
   fieldsep_zero
   
@@ -2585,7 +2613,8 @@ lo_import 152801
   
   
   Sets the output format to one of aligned,
-  asciidoc, html,
+  asciidoc, csv,
+  html,
   latex (uses tabular),
   latex-longtable, troff-ms,
   unaligned, or wrapped.
@@ -2604,6 +2633,21 @@ lo_import 152801
   nicely formatted text output;  this is the default.
   
 
+ csv format writes columns separated by
+ commas, applying the quoting rules described in
+ https://tools.ietf.org/html/rfc4180";>RFC 4180.
+ Alternative separators can be selected with
+ \pset fieldsep_csv.
+ The output is compatible with the CSV format of the
+ COPY command. The header with column names
+ is generated unless the tuples_only parameter is
+ on. Title and footers are not printed.
+ Each row is terminated by the system-dependent end-of-line character,
+ which is typically a single newline (\n) for
+ Unix-like systems or a carriage return and newline sequence
+ (\r\n) for Microsoft Windows.
+ 
+
   wrapped format is like aligned but wraps
   wide data values across lines to make the output fit in the target
   column width.  The target width is determined as described under
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 0dea54d3ce..7e3f577897 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1941,8 +1941,8 @@ exec_command_pset(PsqlScanState scan_state, bool active_branch)
 
 			int			i;
 			static const char *const my_list[] = {
-"border", "columns", "expanded", "fieldsep", "fieldsep_zero",
-"footer", "format", "linestyle", "null",
+"border", "columns", "expanded", "fieldsep", "fieldsep_csv",
+"fieldsep_zero", "footer", "format", "linestyle", "null",
 "numericlocale", "pager", "pager_min_lines",
 "recordsep", "recordsep_zero",
 "tableattr", "title", "tuples_only",
@@ -3566,6 +3566,9 @@ _align2string(enum printFormat in)
 		case PRINT_ASCIIDOC:
 			return "asciidoc";
 			break;
+		case PRINT_CSV:
+			return "csv";
+			break;
 		case PRINT_HTML:
 			return "html";
 			break;
@@ -3643,6 +3646,8 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 			popt->topt.format = PRINT_ALIGNED;
 		else if (pg_strncasecmp("asciidoc", value, vallen) == 0)
 			popt->topt.format = PRINT_ASCIIDOC;
+		else if (pg_strncasecmp("csv", value, vallen) == 0)
+			popt->topt.format = PRINT_CSV;
 		else if (pg_strncasecmp("html", value, vallen) == 0)
 			popt->topt.format = PRINT_HTML;
 		else if (pg_strncasecmp("latex", value, vallen) == 0)
@@ -3657,7 +3662,7 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 			popt->topt.format = PRINT_WRAPPED;
 		else
 		{
-			psql_error("\\pset: allowed formats are aligned, asciidoc, html, latex, latex-longtable, troff-ms, unaligned, wrapped\n");
+			psql_error("\\pset: allowed formats are aligned, asciidoc, csv, html, latex, latex-longtable, troff-ms, unaligned, wrapped\n");
 			return false;
 		}
 	}
@@ -3785,6 +3790,26 @@ do_pset(const char

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-08 Thread Amit Langote

On 2018/11/08 20:28, David Rowley wrote:
> On 8 November 2018 at 20:15, Amit Langote  
> wrote:
>> Actually, as I also proposed upthread, we should move root_tuple_slot from
>> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
>> it's part of the first step described above that has nothing to do with
>> partition tuple routing proper.  We can make PartitionTupleRouting private
>> to execPartition.c if we do that.
> 
> okay. Makes sense. I've changed things around to PartitionTupleRouting
> is now private to execPartition.c

Thank you.  I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Thanks,
Amit

Re: csv format for psql

2018-11-08 Thread Pavel Stehule

pá 9. 11. 2018 v 6:57 odesílatel Michael Paquier 
napsal:

> On Thu, Nov 08, 2018 at 01:58:34PM +0900, Michael Paquier wrote:
> > Anyway, I am still going through the patch, so no need to send a new
> > version for now.
>
> Okay, I have done a round of more in-depth review, and the patch looks
> to be in pretty good shape.
>
> Relying on tuples_only to decide if the header should be printed or not
> looks good to me.
>
> +/* check for value being non-empty and with an MB length of 1 */
> +if (*value == '\0' || value[PQmblen(value, pset.encoding)] != '\0')
>
> It seems to me that this can just be replaced with that:
> if (strlen(value) != 1)
>
> Attached is what I am finishing up with for the moment.  Comments are
> welcome.  I am still planning look at that stuff a bit more once again,
> particularly the printing part, but I am lacking of time now..
>

looks ok

Pavel

--
> Michael
>

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-08 Thread Thomas Munro

On Fri, Nov 9, 2018 at 6:39 PM Kyotaro HORIGUCHI
 wrote:
> Mmm..I didn't get an error at hand on both CentOS7 and High Sierra.
>
> | $ perl -e 'for (my $i=0; $i< 100; $i++) { print "$i\n"; }' | head -5
> ...
> | 4
> | $ echo $?
> | 0

That's the exit code from head.  You can see python or perl's exit
code by adding strace in front (on Linux):

$ strace python -c "for i in range(100): print i" | head -5
...
+++ exited with 1 +++

$ strace perl -e 'for (my $i=0; $i< 100; $i++) { print "$i\n"; }' | head -5
...
+++ killed by SIGPIPE +++

> create foreign table ft5 (a text) server svf1 options (program 'ruby -e "for 
> i in 1..1000 do puts i; end"');
> select * from ft5 limit 5;
>  a
> ---
>  1
> ...
>  5
> (5 rows)
> (no error)

1000 is not enough... due to buffering, it works.  Try 100:

postgres=# create foreign table ft5 (a text) server svf1 options
(program 'ruby -e "for i in 1..100 do puts i; end"');
CREATE FOREIGN TABLE
postgres=# select * from ft5 limit 5;
ERROR:  program "ruby -e "for i in 1..100 do puts i; end"" failed
DETAIL:  child process exited with exit code 1

-- 
Thomas Munro
http://www.enterprisedb.com

97 matches

Mail list logo