date:20190508

Re: vacuumdb and new VACUUM options

2019-05-08 Thread Michael Paquier

On Wed, May 08, 2019 at 09:26:35AM +0900, Masahiko Sawada wrote:
> I think it's a good idea to add new options of these parameters for
> vacuumdb. While making INDEX_CLEANUP option patch I also attached the
> patch for INDEX_CLEANUP parameter before[1], although it adds
> --disable-index-cleanup option instead.

I have added an open item for that.  I think that we should added
these options.
--
Michael


signature.asc
Description: PGP signature

Re: copy-past-o comment in lock.h

2019-05-08 Thread Michael Paquier

On Tue, May 07, 2019 at 04:12:31PM +0800, John Naylor wrote:
> That's probably better.

Would you like to send an updated patch?  Perhaps you have a better
idea?
--
Michael


signature.asc
Description: PGP signature

Inconsistency between table am callback and table function names

2019-05-08 Thread Ashwin Agrawal

The general theme for table function names seem to be
"table_". For example table_scan_getnextslot() and its
corresponding callback scan_getnextslot(). Most of the table functions and
callbacks follow mentioned convention except following ones

 table_beginscan
 table_endscan
 table_rescan
 table_fetch_row_version
 table_get_latest_tid
 table_insert
 table_insert_speculative
 table_complete_speculative
 table_delete
 table_update
 table_lock_tuple

the corresponding callback names for them are

 scan_begin
 scan_end
 scan_rescan
 tuple_fetch_row_version
 tuple_get_latest_tid
 tuple_insert
 tuple_insert_speculative
 tuple_delete
 tuple_update
 tuple_lock

It confuses while browsing through the code and hence I would like to
propose we make them consistent. Either fix the callback names or table
functions but all should follow the same convention, makes it easy to
browse around and refer to as well. Personally, I would say fix the table
function names as callback names seem fine. So, for example, make it
table_scan_begin().

Also, some of these table function names read little odd

table_relation_set_new_filenode
table_relation_nontransactional_truncate
table_relation_copy_data
table_relation_copy_for_cluster
table_relation_vacuum
table_relation_estimate_size

Can we drop relation word from callback names and as a result from these
function names as well? Just have callback names as set_new_filenode,
copy_data, estimate_size?

Also, a question about comments. Currently, redundant comments are written
above callback functions as also above table functions. They differ
sometimes a little bit on descriptions but majority content still being the
same. Should we have only one place finalized to have the comments to keep
them in sync and also know which one to refer to?

Plus, file name amapi.h now seems very broad, if possible should be renamed
to indexamapi.h or indexam.h to follow tableam.h. No idea what's our policy
around header file renames.

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-08 Thread Michael Paquier

On Tue, May 07, 2019 at 05:19:38PM -0400, Tom Lane wrote:
> That's nothing but a hack, and the reason it's necessary is that
> index_create will throw error if IsCatalogRelation is true, which
> it will be for information_schema TOAST tables --- but not for their
> parent tables that are being examined here.

Oh.  Good catch.  That's indeed crazy now that I look closer at that.

> After looking around, it seems to me that the correct definition for
> IsCatalogRelation is just "is the OID less than FirstBootstrapObjectId?".
> Currently we could actually restrict it to "less than
> FirstGenbkiObjectId", because all the catalogs, indexes, and TOAST tables
> have hand-assigned OIDs --- but perhaps someday we'll let genbki.pl
> assign some of those OIDs, so I prefer the weaker constraint.  In any
> case, this gives us a correct separation between objects that are
> traceable to the bootstrap data and those that are created by plain SQL
> later in initdb.
> 
> With this, the Form_pg_class argument to IsCatalogClass becomes
> vestigial.  I'm tempted to get rid of that function altogether in
> favor of direct calls to IsCatalogRelationOid, but haven't done so
> in the attached.

I think that removing entirely IsCatalogClass() is just better as if
any extension uses this routine, then it could potentially simplify
its code because needing Form_pg_class means usually opening a
Relation, and this can be removed.

With IsCatalogClass() removed, the only dependency with Form_pg_class
comes from IsToastClass() which is not used at all except in
IsSystemClass().  Wouldn't it be better to remove entirely
IsToastClass() and switch IsSystemClass() to use a namespace OID
instead of Form_pg_class?

With your patch, ReindexRelationConcurrently() does not complain for
REINDEX TABLE CONCURRENTLY for a catalog table and would trigger the
error from index_create(), which is at the origin of this thread.  The
check with IsSharedRelation() for REINDEX INDEX CONCURRENTLY is
useless and the error message generated for IsCatalogRelationOid()
still needs to be improved.  Would you prefer to include those changes
in your patch?  Or should I work on top of what you are proposing
(your patch does not include negative tests for toast index and
tables on catalogs either).
--
Michael


signature.asc
Description: PGP signature

Re: copy-past-o comment in lock.h

2019-05-08 Thread John Naylor

On Wed, May 8, 2019 at 3:10 PM Michael Paquier  wrote:
>
> On Tue, May 07, 2019 at 04:12:31PM +0800, John Naylor wrote:
> > That's probably better.
>
> Would you like to send an updated patch?  Perhaps you have a better
> idea?
> --
> Michael

In the attached, I've used your language, and also moved the comments
closer to the code they are describing. That seems more logical and
future proof.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


locktag-comment-v2.patch
Description: Binary data

Re: copy-past-o comment in lock.h

2019-05-08 Thread Michael Paquier

On Wed, May 08, 2019 at 03:59:36PM +0800, John Naylor wrote:
> In the attached, I've used your language, and also moved the comments
> closer to the code they are describing. That seems more logical and
> future proof.

Good idea to move the comments so what you proposes looks fine to me.
Are there any objections?
--
Michael


signature.asc
Description: PGP signature

Re: Regression test PANICs with master-standby setup on same machine

2019-05-08 Thread Kyotaro HORIGUCHI

At Tue, 7 May 2019 10:55:06 +0900, Michael Paquier  wrote 
in <20190507015506.gc1...@paquier.xyz>
> On Tue, May 07, 2019 at 10:16:54AM +0900, Kyotaro HORIGUCHI wrote:
> > The fake symlinks need correction after the data directory and
> > tablespsce directory are moved. Maybe needs to call
> > CorrectSymlink() or something at startup... Or relative
> > tablespaces should be rejected on Windows?
> 
> It took enough sweat and tears to have an implementation with junction
> points done correctly on Windows and we know that it works, so I am

Indeed. It is very ill documented and complex.

> not sure that we need an actual wrapper for readlink() and such for
> the backend code to replace junction points.  The issue with Windows
> is that perl's symlink() is not directly available on Windows.

Ugg. If we want to run tablespace-related tests involving
replication on Windows, we need to make the tests using absolute
tablespace paths. Period...?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: vacuumdb and new VACUUM options

2019-05-08 Thread Julien Rouhaud

On Wed, May 8, 2019 at 9:06 AM Michael Paquier  wrote:
>
> On Wed, May 08, 2019 at 09:26:35AM +0900, Masahiko Sawada wrote:
> > I think it's a good idea to add new options of these parameters for
> > vacuumdb. While making INDEX_CLEANUP option patch I also attached the
> > patch for INDEX_CLEANUP parameter before[1], although it adds
> > --disable-index-cleanup option instead.
>
> I have added an open item for that.  I think that we should added
> these options.

+1, and thanks for adding the open item!

Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-05-08 Thread Thomas Munro

On Mon, May 6, 2019 at 3:30 PM Thomas Munro  wrote:
> I put the second sentence back and tweaked it thus: s/fails/might
> fail/.  Maybe I'm being too pedantic here, but binding to 127.0.0.2
> works on other OSes too, as long as you've configured an interface or
> alias for it (and it's not terribly uncommon to do so).  Here's a
> version with a commit message.  Please let me know if you want to
> tweak the comments further.

Pushed, with a further adjustment to the comment.

-- 
Thomas Munro
https://enterprisedb.com

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Robert Haas

On Wed, May 8, 2019 at 12:09 AM Kyotaro HORIGUCHI
 wrote:
> By the way, as mentioned above, this issue exists since 11 but
> harms at 12. Is this an open item, or older bug?

Sounds more like an open item to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-08 Thread Tom Lane

Michael Paquier  writes:
> On Tue, May 07, 2019 at 05:19:38PM -0400, Tom Lane wrote:
>> With this, the Form_pg_class argument to IsCatalogClass becomes
>> vestigial.  I'm tempted to get rid of that function altogether in
>> favor of direct calls to IsCatalogRelationOid, but haven't done so
>> in the attached.

> I think that removing entirely IsCatalogClass() is just better as if
> any extension uses this routine, then it could potentially simplify
> its code because needing Form_pg_class means usually opening a
> Relation, and this can be removed.

Yeah, it's clearly easier to use without the extra argument.

> With IsCatalogClass() removed, the only dependency with Form_pg_class
> comes from IsToastClass() which is not used at all except in
> IsSystemClass().  Wouldn't it be better to remove entirely
> IsToastClass() and switch IsSystemClass() to use a namespace OID
> instead of Form_pg_class?

Not sure.  The way it's defined has the advantage of being more
independent of exactly what the implementation of the "is a toast table"
check is.  Also, I looked around to see if any callers could really be
simplified if they only had to pass the table OID, and didn't find much;
almost all of them are looking at the pg_class tuple themselves, typically
to check the relkind too.  So we'd not make any net savings in syscache
lookups by changing IsSystemClass's API.  I'm kind of inclined to leave
it alone.

> With your patch, ReindexRelationConcurrently() does not complain for
> REINDEX TABLE CONCURRENTLY for a catalog table and would trigger the
> error from index_create(), which is at the origin of this thread.  The
> check with IsSharedRelation() for REINDEX INDEX CONCURRENTLY is
> useless and the error message generated for IsCatalogRelationOid()
> still needs to be improved.  Would you prefer to include those changes
> in your patch?  Or should I work on top of what you are proposing
> (your patch does not include negative tests for toast index and
> tables on catalogs either).

Yes, we still need to do your patch on top of this one (or really
either order would do).  I think keeping them separate is good.

BTW, when I was looking at this I got dissatisfied about another
aspect of the wording of the relevant error messages: a lot of them
are like, for example

 errmsg("cannot reindex concurrently this type of relation")));

While that matches the command syntax we're using, it's just horrid
English grammar.  Better would be

 errmsg("cannot reindex this type of relation concurrently")));

Can we change that while we're at it?

regards, tom lane

Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-08 Thread Peter Eisentraut

On 2019-05-07 05:07, Michael Paquier wrote:
> On Sat, May 04, 2019 at 09:59:20PM +0900, Michael Paquier wrote:
>> The result should be no deadlocks happening in the two sessions
>> running the reindex.  I can see the deadlock easily with three psql
>> sessions, running manually the queries.
> 
> +* If the OID isn't valid, it means the index as concurrently dropped,
> +* which is not a problem for us; just return normally.
> Typo here s/as/is/.
> 
> I have looked closer at the patch and the change proposed looks good
> to me.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-08 Thread Michael Paquier

On Wed, May 08, 2019 at 08:31:54AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> With IsCatalogClass() removed, the only dependency with Form_pg_class
>> comes from IsToastClass() which is not used at all except in
>> IsSystemClass().  Wouldn't it be better to remove entirely
>> IsToastClass() and switch IsSystemClass() to use a namespace OID
>> instead of Form_pg_class?
> 
> Not sure.  The way it's defined has the advantage of being more
> independent of exactly what the implementation of the "is a toast table"
> check is.  Also, I looked around to see if any callers could really be
> simplified if they only had to pass the table OID, and didn't find much;
> almost all of them are looking at the pg_class tuple themselves, typically
> to check the relkind too.  So we'd not make any net savings in syscache
> lookups by changing IsSystemClass's API.  I'm kind of inclined to leave
> it alone.

Hmm.  Okay.  It would have been nice to remove completely the
dependency to Form_pg_class from this set of APIs, but I can see your
point.

> Yes, we still need to do your patch on top of this one (or really
> either order would do).  I think keeping them separate is good.

Okay, glad to hear.  That's what I wanted to do.

> While that matches the command syntax we're using, it's just horrid
> English grammar.  Better would be
> 
>  errmsg("cannot reindex this type of relation 
> concurrently")));
> 
> Can we change that while we're at it?

No problem to do that.  I'll brush up all that once you commit the
first piece you have come up with, and reuse the new API of catalog.c
you are introducing based on the table OID.
--
Michael


signature.asc
Description: PGP signature

Re: accounting for memory used for BufFile during hash joins

2019-05-08 Thread Tomas Vondra

On Tue, May 07, 2019 at 05:43:56PM -0700, Melanie Plageman wrote:

  On Tue, May 7, 2019 at 6:59 AM Tomas Vondra 
  wrote:

On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> Switching to some other algorithm during execution moves the goal
posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's
mostly
irrelevant to the main issue - switching between algorithms
mid-execution.
At that point some of the tuples might have been already sent sent to
the
other nodes, and I have no idea how to "resume" the tuple stream short
of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

  What if you switched to NLJ on a batch-by-batch basis and did it before
  starting
  execution of the join but after building the inner side of the hash
  table.  That
  way, no tuples will have been sent to other nodes yet.

Interesting idea! I think you're right doing it on a per-batch basis
would solve that problem. Essentially, if all (or >95%) of the tuples
has the same hash value, we could switch to a special "degraded" mode
doing something like a NL. At that point the hash table benefits are
lost anyway, because all the tuples are in a single chain, so it's not
going to be much slower.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-05-08 Thread Robert Haas

On Tue, May 7, 2019 at 2:10 AM Masahiko Sawada  wrote:
> > > That better not be true.  If you have a design where reading the WAL
> > > lets you get *any* encryption key, you have a bad design, I think.
>
> How does the startup process decrypt WAL during recovery without
> getting any encryption key if we encrypt user data in WAL by multiple
> encryption keys?

The keys have to be supplied from someplace outside of the database
system.  I am imagining a command that gets run with the key ID as an
argument and is expected to print the key out on standard output for
the server to read.

I am not an encryption expert, but it's hard for me to imagine this
working any other way.  I mean, if you store the keys that you need
for decryption inside the database, isn't that the same as storing
your house key in your house, or your car key in your car?  If you
store your car key in the car, then either the car is locked from the
outside, and the key is useless to you, or the car is unlocked from
the outside, and the key is just as available to a thief as it is to
you.  Either way, it provides no security.  What you do is keep your
car key in your pocket or purse; if you try to start the car, it
"requests" the key from you as proof that you are entitled to start
it.  I think the database has to work similarly, except that rather
than protecting the act of "starting" the database, each key is
requested the first time it's needed, when it's discovered that we
need to decrypt some data encrypted with that key.

> > > Well, what threat are you trying to protect against?
>
> Data theft bypassing PostgreSQL's ACL, for example a malicious user
> thefts storage devices and reads datbase files directly.
>
> I'm thinking that only users who have an access privilege of the
> database object can get encryption key for the object. Therefore, when
> a malicious user stole an encryption key by breaking the access
> control system if we suppose data at rest encryption to serve as a yet
> another access control layer we have to use the same encryption key
> for WAL as that we used for database file. But I thought that we
> should rather protect data from that situation by access control
> system and managing encryption keys more robustly.

I don't really follow that logic.  If the encryption keys are managed
robustly enough that they cannot be stolen, then we only need one.  If
there is still enough risk of key theft that we care to protect
against it, we can't use a dedicated key for the WAL without
increasing the risk.

> > > > FWIW, binary log encryption of MySQL uses different encryption key
> > > > from a key used for table[1]. The key is encrypted by the master key
> > > > for binary log encryption and is stored in each file headers.
> > >
> > > So, if you steal the master key for binary log encryption, you can
> > > decrypt everything, it sounds like.
>
> Yes, I think so.

I am not keen to copy that design.  It sounds like having multiple
keys in this design adds a lot of complexity without adding much
security.

> > > Data other than table and index data seems like it is not very
> > > security-sensitive.  I'm not sure we need to encrypt it at all.  If we
> > > do, using one key seems fine.
>
> Agreed. But it seems not to satisfy some user who require to encrypt
> everything, which we discussed before.

Agreed.  I'm thinking possibly we need two different facilities.
Facility #1 could be whole-database encryption: everything is
encrypted with one key on a block level.  And facility #2 could be
per-table encryption: blocks for specific tables (and the related
TOAST tables, indexes, and relation forks) are encrypted with specific
keys and, in addition, the WAL records for those tables (and the
related TOAST tables, indexes, and relation forks) are encrypted with
the same key, but on a per-WAL-record level; the original WAL record
would get "wrapped" by a new WAL record that just says "I am an
encrypted WAL record, key ID %d, encrypted contents: %s" and you have
to get the key to decrypt the contents and decrypt the real WAL record
inside of it.  Then you process that interior record as normal.

I guess if you had both things, you'd want tables for which facility
#2 was enabled to bypass facility #1, so that no relation data blocks
were doubly-encrypted, to avoid the overhead.  But a WAL record would
be doubly-encrypted when both facilities are in use: the record would
get encrypted with the per-table key, and then the blocks it got
stored into would be encrypted with the cluster-wide key.

> I wanted to say that if we encrypt whole database cluster by single
> encryption key we would need to rebuilt the database cluster when
> re-encrypt data. But if we encrypt data in tablespaces by per
> tablespace encryption keys we can re-encrypt data by moving
> tablespaces, without rebuilt it.

Interesting.  I suppose that would also be true of per-table keys.
CREATE TABLE newthunk ENCRYPT WITH 'hoge' AS SELECT * FROM thunk; or
something o

any suggestions to detect memory corruption

2019-05-08 Thread Alex

I can get the following log randomly and I am not which commit caused it.
I spend one day but failed at last.


2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
2019-05-08 21:37:46.692 CST [60110] WARNING:  idx: 2 problem in alloc set
index info: bad single-chunk 0x2a33a78 in block 0x2a33a18, chsize: 1408,
chunkLimit: 1024, chunkHeaderSize: 24, block_used: 768 request size: 2481
2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
info: found inconsistent memory block 0x2a33a18

 it looks like the memory which is managed by "index info" memory context
is written by some other wrong codes.

I didn't change any AllocSetXXX related code and I think I just use it
wrong in some way.

Thanks

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-08 Thread Robert Haas

On Tue, May 7, 2019 at 4:16 PM Andres Freund  wrote:
> Just to attach some numbers for this. On my laptop, with a pretty fast
> disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
> these results.
>
>  [ results showing ring buffers massively hurting performance ]

Links to some previous discussions:

http://postgr.es/m/8737e9bddb82501da1134f021bf49...@postgrespro.ru
http://postgr.es/m/CAMkU=1yV=zq8shviv5nwajv5wowovzb7bx45rgdvtxs4p6w...@mail.gmail.com

> We probably can't remove the ringbuffer concept from these places, but I
> think we should allow users to disable them. Forcing bulk-loads, vacuum,
> analytics queries to go to the OS/disk, just because of a heuristic that
> can't be disabled, yielding massive slowdowns, really sucks.

The discussions to which I linked above seem to suggest that one of
the big issues is that the ring buffer must be large enough that WAL
flush for a buffer can complete before we go all the way around the
ring and get back to the same buffer.  It doesn't seem unlikely that
the size necessary for that to be true has changed over the years, or
even that it's different on different hardware.  When I did some
benchmarking in this area many years ago, I found that there as you
increase the ring buffer size, performance improves for a while and
then more or less levels off at a certain point.  And at that point
performance is not much worse than it would be with no ring buffer,
but you maintain some protection against cache-trashing.  Your
scenario assumes that the system has no concurrent activity which will
suffer as a result of blowing out the cache, but in general that's
probably not true.

It seems to me that it might be time to bite the bullet and add GUCs
for the ring buffer sizes.  Then, we could make the default sizes big
enough that on normal-ish hardware the performance penalty is not too
severe (like, it's measured as a percentage rather than a multiple),
and we could make a 0 value disable the ring buffer altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: make \d pg_toast.foo show its indices

2019-05-08 Thread Robert Haas

On Tue, May 7, 2019 at 6:03 PM Stephen Frost  wrote:
> Alright, maybe I'm not the best representation of our user base, but I
> sure type 'select oid,* from pg_class where relname = ...' with some
> regularity, mostly to get the oid to then go do something else.  Having
> the relfilenode would be nice too, now that I think about it, and
> reltuples.  There's ways to get *nearly* everything that's in pg_class
> and friends out of various \d incantations, but not quite everything,
> which seems unfortunate.
>
> In any case, I can understand an argument that the code it requires is
> too much to maintain for a relatively minor feature (though it hardly
> seems like it would be...) or that it would be confusing or unhelpful to
> users (aka "noise") much of the time, so I'll leave it to others to
> comment on if they think any of these ideas be a useful addition or not.
>
> I just don't think we should be voting down a feature because it'd take
> a bit of extra effort to make our regression tests work with it, which
> is all I was intending to get at here.

I think it's unjustifiable to show this in \d output.  But maybe in
\d+ output it could be justified, or perhaps in the \d++ which I seem
to recall Alvaro proposing someplace recently.

I think if we're going to show it, it should be on its own line, with
a clear label, not just in the table header as you proposed.
Otherwise, people won't know what it is.

I suppose the work we'd need to make it work with the regression tests
is no worse than the hide_tableam crock which Andres recently added.
That is certainly a crock, but I can testify that it's a very useful
crock for zheap development.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: make \d pg_toast.foo show its indices

2019-05-08 Thread Tom Lane

Robert Haas  writes:
> I think it's unjustifiable to show this in \d output.  But maybe in
> \d+ output it could be justified, or perhaps in the \d++ which I seem
> to recall Alvaro proposing someplace recently.

Yeah, if we're going to do that (show a table's toast table) I would
want to bury it in \d++ or some other not-currently-used notation.

regards, tom lane

Re: any suggestions to detect memory corruption

2019-05-08 Thread Tom Lane

Alex  writes:
> I can get the following log randomly and I am not which commit caused it.

> 2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
> info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18

I've had success in finding memory stomp causes fairly quickly by setting
a hardware watchpoint in gdb on the affected location.  Then you just let
it run to see when the value changes, and check whether that's a "legit"
or "not legit" modification point.

The hard part of that, of course, is to know in advance where the affected
location is.  You may be able to make things sufficiently repeatable by
doing the problem query in a fresh session each time.

regards, tom lane

Re: Identity columns should own only one sequence

2019-05-08 Thread Laurenz Albe

On Tue, 2019-05-07 at 13:06 +0900, Michael Paquier wrote:
> On Fri, May 03, 2019 at 08:14:35AM +0200, Laurenz Albe wrote:
> > On Thu, 2019-05-02 at 22:43 +0200, Peter Eisentraut wrote:
> >> I think the proper way to address this would be to create some kind of
> >> dependency between the sequence and the default.
> > 
> > That is certainly true.  But that's hard to retrofit into existing 
> > databases,
> > so it would probably be a modification that is not backpatchable.
> 
> And this is basically already the dependency which exists between the
> sequence and the relation created with the serial column.  So what's
> the advantage of adding more dependencies if we already have what we
> need?  I still think that we should be more careful to drop the
> dependency between the sequence and the relation's column if dropping
> the default using it.  If a DDL defines first a sequence, and then a
> default expression using nextval() on a column, then no serial-related

I believe we should have both:

- Identity columns should only use sequences with an INTERNAL dependency,
  as in Peter's patch.

- When a column default is dropped, remove all dependencies between the
  column and sequences.

In the spirit of moving this along, I have attached a patch which is
Peter's patch from above with a regression test.

Yours,
Laurenz Albe
From 4280a5251320d2051ada7aa1888ba20a610efa94 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 8 May 2019 16:42:10 +0200
Subject: [PATCH] XXX Draft with regression tests XXX

---
 src/backend/catalog/pg_depend.c| 26 +++---
 src/backend/commands/tablecmds.c   |  2 +-
 src/backend/parser/parse_utilcmd.c | 12 +---
 src/backend/rewrite/rewriteHandler.c   |  2 +-
 src/include/catalog/dependency.h   |  2 +-
 src/test/regress/expected/identity.out |  5 +
 src/test/regress/sql/identity.sql  |  7 +++
 7 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index f7caedcc02..63c94cfbe6 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -707,8 +707,8 @@ sequenceIsOwned(Oid seqId, char deptype, Oid *tableId, int32 *colId)
  * Collect a list of OIDs of all sequences owned by the specified relation,
  * and column if specified.
  */
-List *
-getOwnedSequences(Oid relid, AttrNumber attnum)
+static List *
+getOwnedSequences_internal(Oid relid, AttrNumber attnum, char deptype)
 {
 	List	   *result = NIL;
 	Relation	depRel;
@@ -750,7 +750,8 @@ getOwnedSequences(Oid relid, AttrNumber attnum)
 			(deprec->deptype == DEPENDENCY_AUTO || deprec->deptype == DEPENDENCY_INTERNAL) &&
 			get_rel_relkind(deprec->objid) == RELKIND_SEQUENCE)
 		{
-			result = lappend_oid(result, deprec->objid);
+			if (!deptype || deprec->deptype == deptype)
+result = lappend_oid(result, deprec->objid);
 		}
 	}
 
@@ -761,18 +762,29 @@ getOwnedSequences(Oid relid, AttrNumber attnum)
 	return result;
 }
 
+List *
+getOwnedSequences(Oid relid, AttrNumber attnum)
+{
+	return getOwnedSequences_internal(relid, attnum, 0);
+}
+
 /*
- * Get owned sequence, error if not exactly one.
+ * Get owned identity sequence, error if not exactly one.
  */
 Oid
-getOwnedSequence(Oid relid, AttrNumber attnum)
+getIdentitySequence(Oid relid, AttrNumber attnum, bool missing_ok)
 {
-	List	   *seqlist = getOwnedSequences(relid, attnum);
+	List	   *seqlist = getOwnedSequences_internal(relid, attnum, DEPENDENCY_INTERNAL);
 
 	if (list_length(seqlist) > 1)
 		elog(ERROR, "more than one owned sequence found");
 	else if (list_length(seqlist) < 1)
-		elog(ERROR, "no owned sequence found");
+	{
+		if (missing_ok)
+			return InvalidOid;
+		else
+			elog(ERROR, "no owned sequence found");
+	}
 
 	return linitial_oid(seqlist);
 }
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8e4743d110..8240fec578 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6607,7 +6607,7 @@ ATExecDropIdentity(Relation rel, const char *colName, bool missing_ok, LOCKMODE
 	table_close(attrelation, RowExclusiveLock);
 
 	/* drop the internal sequence */
-	seqid = getOwnedSequence(RelationGetRelid(rel), attnum);
+	seqid = getIdentitySequence(RelationGetRelid(rel), attnum, false);
 	deleteDependencyRecordsForClass(RelationRelationId, seqid,
 	RelationRelationId, DEPENDENCY_INTERNAL);
 	CommandCounterIncrement();
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 4564c0ae81..a3c5b005d5 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1083,7 +1083,7 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 			 * find sequence owned by old column; extract sequence parameters;
 			 * build new create sequence command
 			 */
-			seq_relid = getOwnedSequence(RelationGetRelid(relation), attribute->attnum);
+			seq_relid = getIdentit

Re: accounting for memory used for BufFile during hash joins

2019-05-08 Thread Tomas Vondra


On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:

  On Mon, May 6, 2019 at 8:15 PM Tomas Vondra 
  wrote:

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1]

https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

  Cool, I misunderstood. I looked at the code again today, and, at the email
  thread where you measured "amplification".



Oh! I hope you're not too disgusted by the code in that PoC patch ;-)


  In terms of how many times you write each tuple, is it accurate to
  say that a tuple can now be spilled three times (in the worst case)
  whereas, before, it could be spilled only twice?

  1 - when building the inner side hashtable, tuple is spilled to a "slice"
  file
  2 - (assuming the number of batches was increased) during execution, when
  a tuple belonging to a later slice's spill file is found, it is re-spilled
  to that slice's spill file
  3 - during execution, when reading from its slice file, it is re-spilled
  (again) to its batch's spill file



Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.


  Is it correct that the max number of BufFile structs you will have
  is equal to the number of slices + number of batches in a slice
  because that is the max number of open BufFiles you would have at a
  time?


Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.


  By the way, applying v4 patch on master, in an assert build, I am tripping
  some
  asserts -- starting with
  Assert(!file->readOnly);
  in BufFileWrite


Whps :-/


  One thing I was a little confused by was the nbatch_inmemory member
  of the hashtable.  The comment in ExecChooseHashTableSize says that
  it is determining the number of batches we can fit in memory.  I
  thought that the problem was the amount of space taken up by the
  BufFile data structure itself--which is related to the number of
  open BufFiles you need at a time. This comment in
  ExecChooseHashTableSize makes it sound like you are talking about
  fitting more than one batch of tuples into memory at a time. I was
  under the impression that you could only fit one batch of tuples in
  memory at a time.


I suppose you mean this chunk:

   /*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
   while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
  <= (work_mem * 1024L / 2))
   nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.


  So, I was stepping through the code with work_mem set to the lower
  bound, and in ExecHashIncreaseNumBatches, I got confused.
  hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
  I didn't meet this condition if (nbatch_tmp >
  hashtable->nbatch_inmemory) since I just set nbatch_tmp using
  hashtable->nbatch_inmemory So, I didn't increase the number of
  slices, which is what I was expecting.  What happens when
  hashtable->nbatch_inmemory is equal to nbatch_tmp?



Ah, good catch. The condition you're refering to

   if (nbatch_tmp > hashtable->nbatch_inmemory)

should actually be

   if (nbatch > hashtable->nbatch_inmemory)

because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.

And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a6c6de78f1..105989baee 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2689,6 +2689,8 @@ show_hash_info(HashState *hashstate, ExplainState *es)

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-08 Thread Tomas Vondra


On Wed, May 08, 2019 at 10:08:03AM -0400, Robert Haas wrote:

On Tue, May 7, 2019 at 4:16 PM Andres Freund  wrote:

Just to attach some numbers for this. On my laptop, with a pretty fast
disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
these results.

 [ results showing ring buffers massively hurting performance ]


Links to some previous discussions:

http://postgr.es/m/8737e9bddb82501da1134f021bf49...@postgrespro.ru
http://postgr.es/m/CAMkU=1yV=zq8shviv5nwajv5wowovzb7bx45rgdvtxs4p6w...@mail.gmail.com


We probably can't remove the ringbuffer concept from these places, but I
think we should allow users to disable them. Forcing bulk-loads, vacuum,
analytics queries to go to the OS/disk, just because of a heuristic that
can't be disabled, yielding massive slowdowns, really sucks.


The discussions to which I linked above seem to suggest that one of
the big issues is that the ring buffer must be large enough that WAL
flush for a buffer can complete before we go all the way around the
ring and get back to the same buffer.  It doesn't seem unlikely that
the size necessary for that to be true has changed over the years, or
even that it's different on different hardware.  When I did some
benchmarking in this area many years ago, I found that there as you
increase the ring buffer size, performance improves for a while and
then more or less levels off at a certain point.  And at that point
performance is not much worse than it would be with no ring buffer,
but you maintain some protection against cache-trashing.  Your
scenario assumes that the system has no concurrent activity which will
suffer as a result of blowing out the cache, but in general that's
probably not true.

It seems to me that it might be time to bite the bullet and add GUCs
for the ring buffer sizes.  Then, we could make the default sizes big
enough that on normal-ish hardware the performance penalty is not too
severe (like, it's measured as a percentage rather than a multiple),
and we could make a 0 value disable the ring buffer altogether.



IMO adding such GUC would be useful for testing, which is something we
should probably do anyway, and then based on the results we could either
keep the GUC, modify the default somehow, or do nothing.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Tomas Vondra

On Wed, May 08, 2019 at 01:09:23PM +0900, Kyotaro HORIGUCHI wrote:

At Wed, 08 May 2019 13:06:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190508.130636.184826233.horiguchi.kyot...@lab.ntt.co.jp>

At Tue, 07 May 2019 20:47:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190507.204728.233299873.horiguchi.kyot...@lab.ntt.co.jp>
> Hello.
>
> At Tue, 7 May 2019 14:39:55 +0530, Rajkumar Raghuwanshi 
 wrote in 

> > Hi,
> > As this issue is reproducible without partition-wise aggregate also,
> > changing email subject from "Statistical aggregate functions are not
> > working with partitionwise aggregate " to "Statistical aggregate functions
> > are not working with PARTIAL aggregation".
> >
> > original reported test case and discussion can be found at below link.
> > 
https://www.postgresql.org/message-id/flat/CAKcux6%3DuZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew%40mail.gmail.com
>
> The immediate reason for the behavior seems that
> EEOP_AGG_STRICT_INPUT_CHECK_ARGS considers non existent second
> argument as null, which is out of arguments list in
> trans_fcinfo->args[].
>
> The invalid deserialfn_oid case in ExecBuildAggTrans, it
> initializes args[1] using the second argument of the functoin
> (int8pl() in the case) so the correct numTransInputs here is 1,
> not 2.
>
> I don't understand this fully but at least the attached patch
> makes the test case work correctly and this seems to be the only
> case of this issue.

This behavior is introduced by 69c3936a14 (in v11).  At that time
FunctionCallInfoData is pallioc0'ed and has fixed length members
arg[6] and argnull[7]. So nulls[1] is always false even if nargs
= 1 so the issue had not been revealed.

After introducing a9c35cf85c (in v12) the same check is done on
FunctionCallInfoData that has NullableDatum args[] of required
number of elements. In that case args[1] is out of palloc'ed
memory so this issue has been revealed.

In a second look, I seems to me that the right thing to do here
is setting numInputs instaed of numArguments to numTransInputs in
combining step.

By the way, as mentioned above, this issue exists since 11 but
harms at 12. Is this an open item, or older bug?

It is an open item - there's a section for older bugs, but considering
it's harmless in 11 (at least that's my understanding from the above
discussion) I've added it as a regular open item.

I've linked it to a9c35cf85c, which seems to be the culprit commit.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-08 Thread Andrey Borodin




> 8 мая 2019 г., в 1:16, Andres Freund  написал(а):
> 
> We probably can't remove the ringbuffer concept from these places, but I
> think we should allow users to disable them. Forcing bulk-loads, vacuum,
> analytics queries to go to the OS/disk, just because of a heuristic that
> can't be disabled, yielding massive slowdowns, really sucks.

If we will have scan-resistant shared buffers eviction strategy [0] - we will 
not need ring buffers unconditionally.
Are there any other reasons to have these rings?

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/89A121E3-B593-4D65-98D9-BBC210B87268%40yandex-team.ru

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Greg Stark

Don't we have a build farm animal that runs under valgrind that would
have caught this?

Re: Patch: doc for pg_logical_emit_message()

2019-05-08 Thread Fujii Masao

On Fri, Apr 26, 2019 at 10:52 AM Matsumura, Ryo
 wrote:
>
> On Wed. Apr. 24, 2019 at 11:40 PM Masao, Fujii
>  wrote:
>
> Thank you for the comment.
> I understand about REPLICATION privilege and notice my unecessary words.
> I update the patch.

Thanks! Pushed.

Regards,

-- 
Fujii Masao

Re: vacuumdb and new VACUUM options

2019-05-08 Thread Fujii Masao

On Wed, May 8, 2019 at 9:32 AM Masahiko Sawada  wrote:
>
> On Wed, May 8, 2019 at 2:41 AM Fujii Masao  wrote:
> >
> > Hi,
> >
> > vacuumdb command supports the corresponding options to
> > any VACUUM parameters except INDEX_CLEANUP and TRUNCATE
> > that were added recently. Should vacuumdb also support those
> > new parameters, i.e., add --index-cleanup and --truncate options
> > to the command?
>
> I think it's a good idea to add new options of these parameters for
> vacuumdb. While making INDEX_CLEANUP option patch I also attached the
> patch for INDEX_CLEANUP parameter before[1], although it adds
> --disable-index-cleanup option instead.

Regarding INDEX_CLEANUP, now VACUUM has three modes;

(1) VACUUM (INDEX_CLEANUP on) does index cleanup
whatever vacuum_index_cleanup reloption is.
(2) VACUUM (INDEX_CLEANUP off) does not do index cleanup
whatever vacuum_index_cleanup reloption is.
(3) plain VACUUM decides whether to do index cleanup
according to vacuum_index_cleanup reloption.

If no option for index cleanup is specified, vacuumdb command
should work in the mode (3). IMO this is intuitive.

The question is; we should support vacuumdb option for (1), i.e.,,
something like --index-cleanup option is added?
Or for (2), i.e., something like --disable-index-cleanup option is added
as your patch does? Or for both?

Regards,

-- 
Fujii Masao

Re: any suggestions to detect memory corruption

2019-05-08 Thread Robert Haas

On Wed, May 8, 2019 at 10:34 AM Tom Lane  wrote:
> Alex  writes:
> > I can get the following log randomly and I am not which commit caused it.
>
> > 2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
> > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
>
> I've had success in finding memory stomp causes fairly quickly by setting
> a hardware watchpoint in gdb on the affected location.  Then you just let
> it run to see when the value changes, and check whether that's a "legit"
> or "not legit" modification point.
>
> The hard part of that, of course, is to know in advance where the affected
> location is.  You may be able to make things sufficiently repeatable by
> doing the problem query in a fresh session each time.

valgrind might also be a possibility, although that has a lot of overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-08 Thread Andres Freund

Hi,

On 2019-05-08 10:08:03 -0400, Robert Haas wrote:
> On Tue, May 7, 2019 at 4:16 PM Andres Freund  wrote:
> > Just to attach some numbers for this. On my laptop, with a pretty fast
> > disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
> > these results.
> >
> >  [ results showing ring buffers massively hurting performance ]
>
> Links to some previous discussions:
>
> http://postgr.es/m/8737e9bddb82501da1134f021bf49...@postgrespro.ru
> http://postgr.es/m/CAMkU=1yV=zq8shviv5nwajv5wowovzb7bx45rgdvtxs4p6w...@mail.gmail.com
>
> > We probably can't remove the ringbuffer concept from these places, but I
> > think we should allow users to disable them. Forcing bulk-loads, vacuum,
> > analytics queries to go to the OS/disk, just because of a heuristic that
> > can't be disabled, yielding massive slowdowns, really sucks.
>
> The discussions to which I linked above seem to suggest that one of
> the big issues is that the ring buffer must be large enough that WAL
> flush for a buffer can complete before we go all the way around the
> ring and get back to the same buffer.

That is some of the problem, true. But even on unlogged tables the
ringbuffers cause quite the massive performance deterioration. Without
the ringbuffers we write twice the size of the releation (once with
zeroes for the file extension, once with with the actual data). With the
ringbuffer we do so two or three additional times (hint bits + normal
vacuum, then freezing).

On a test-cluster that replaced the smgrextend() for heap with
posix_fallocate() (to avoid the unnecesary write), I measured the
performance of CTAS UNLOGGED SELECT * FROM pgbench_accounts_scale_1000
with and without ringbuffers:

With ringbuffers:
CREATE UNLOGGED TABLE AS: Time: 67808.643 ms (01:07.809)
VACUUM: Time: 53020.848 ms (00:53.021)
VACUUM FREEZE: Time: 55809.247 ms (00:55.809)

Without ringbuffers:
CREATE UNLOGGED TABLE AS: Time: 45981.237 ms (00:45.981)
VACUUM: Time: 23386.818 ms (00:23.387)
VACUUM FREEZE: Time: 5892.204 ms (00:05.892)

> It doesn't seem unlikely that
> the size necessary for that to be true has changed over the years, or
> even that it's different on different hardware.  When I did some
> benchmarking in this area many years ago, I found that there as you
> increase the ring buffer size, performance improves for a while and
> then more or less levels off at a certain point.  And at that point
> performance is not much worse than it would be with no ring buffer,
> but you maintain some protection against cache-trashing.  Your
> scenario assumes that the system has no concurrent activity which will
> suffer as a result of blowing out the cache, but in general that's
> probably not true.

Well, I noted that I'm not proposing to actually just rip out the
ringbuffers.

But I also don't think it's just a question of concurrent activity. It's
a question of having concurrent activity *and* workloads that are
smaller than shared buffers.

Given current memory sizes a *lot* of workloads fit entirely in shared
buffers - but for vacuum, seqscans (including copy), it's basically
impossible to ever take advantage of that memory, unless your workload
otherwise forces it into s_b entirely (or you manually load the data
into s_b).

> It seems to me that it might be time to bite the bullet and add GUCs
> for the ring buffer sizes.  Then, we could make the default sizes big
> enough that on normal-ish hardware the performance penalty is not too
> severe (like, it's measured as a percentage rather than a multiple),
> and we could make a 0 value disable the ring buffer altogether.

Yea, it'd be considerably better than today. It'd importantly allow us
to more easily benchmark a lot of this.

Think it might make sense to have a VACUUM option for disabling the
ringbuffer too, especially for cases where VACUUMING is urgent.

I think what we ought to do to fix this issue in a bit more principled
manner (afterwards) is:

1) For ringbuffer'ed scans, if there are unused buffers, use them,
   instead of recycling a buffer from the ring. If so, replace the
   previous member of the ring with the previously unused one.  When
   doing so, just reduce the usagecount by one (unless already zero), so
   it readily can be replaced.

   I think we should do so even when the to-be-replaced ringbuffer entry
   is currently dirty. But even if we couldn't agree on that, it'd
   already be a significant improvement if we only did this for clean buffers.

   That'd fix a good chunk of the "my shared buffers is never actually
   used" type issues. I personally think it's indefensible that
   we don't do that today.

2) When a valid buffer in the ringbuffer is dirty when about to be
   replaced, instead of doing the FlushBuffer ourselves (and thus
   waiting for an XLogFlush in many cases), put into a separate
   ringbuffer/qeueue that's processed by bgwriter. And have that then
   invalidate the buffer and put it on the freelist (unless usagecount
   was bumped since,

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-08 Thread Andres Freund

Hi,

On 2019-05-08 21:35:06 +0500, Andrey Borodin wrote:
> > 8 мая 2019 г., в 1:16, Andres Freund  написал(а):
> > 
> > We probably can't remove the ringbuffer concept from these places, but I
> > think we should allow users to disable them. Forcing bulk-loads, vacuum,
> > analytics queries to go to the OS/disk, just because of a heuristic that
> > can't be disabled, yielding massive slowdowns, really sucks.
> 
> If we will have scan-resistant shared buffers eviction strategy [0] -
> we will not need ring buffers unconditionally.

For me that's a fairly big if, fwiw. But it'd be cool.

> Are there any other reasons to have these rings?

Currently they also limit the amount of dirty data added to the
system. I don't think that's a generally good property (e.g. because
it'll cause a lot of writes that'll again happen later), but e.g. for
initial data loads with COPY FREEZE it's helpful. It slows down the
backend(s) causing the work (i.e. doing COPY), rather than other
backends (e.g. because they need to evict the buffers, therefore first
needing to clean them).

Greetings,

Andres Freund

Unexpected "shared memory block is still in use"

2019-05-08 Thread Tom Lane

Just now, while running a parallel check-world on HEAD according to the
same script I've been using for quite some time, one of the TAP tests
died during initdb:

selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... America/New_York
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... 2019-05-08 13:59:19.963 EDT 
[18351] FATAL:  pre-existing shared memory block (key 5440004, ID 1734475802) 
is still in use
2019-05-08 13:59:19.963 EDT [18351] HINT:  Terminate any old server processes 
associated with data directory 
"/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata".
child process exited with exit code 1
initdb: removing data directory 
"/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata"
Bail out!  system initdb failed

I have never seen this happen before in the TAP tests.

I think the odds are very high that this implies something wrong with
commit c09850992.

My immediate guess after eyeballing that patch quickly is that it was
not a good idea to redefine the rules used by bootstrap/standalone
backends.  In particular, it seems somewhat plausible that the bootstrap
process hadn't yet completely died when the standalone backend for the
post-bootstrap phase came along and decided there was a conflict (which
it never would have before).

regards, tom lane

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Andrew Dunstan



On 5/8/19 12:41 PM, Greg Stark wrote:
> Don't we have a build farm animal that runs under valgrind that would
> have caught this?
>
>

There are two animals running under valgrind: lousyjack and skink.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cleanup/remove/update references to OID column

2019-05-08 Thread Justin Pryzby

I found what appears to be a dangling reference to old "hidden" OID behavior.

Justin
>From 1c6712c0ade949648dbc415dfd7ea80312360ef7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 8 May 2019 13:57:12 -0500
Subject: [PATCH v1] Cleanup/remove/update references to OID column...

..in wake of 578b229718e8f.

See also
93507e67c9ca54026019ebec3026de35d30370f9
1464755fc490a9911214817fe83077a3689250ab
f6b39171f3d65155b9390c2c69bc5b3469f923a8

Author: Justin Pryzby 
---
 doc/src/sgml/catalogs.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d544e60..0f9c6f2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -610,7 +610,7 @@
   oid
   oid
   
-  Row identifier (hidden attribute; must be explicitly selected)
+  Row identifier
  
 
  
-- 
2.7.4

Re: New EXPLAIN option: ALL

2019-05-08 Thread Robert Haas

On Tue, May 7, 2019 at 6:25 PM Tom Lane  wrote:
> Meh --- I don't especially care for non-orthogonal behaviors like that.
> If you wanted JSON but *not* all of the additional info, how would you
> specify that?  (The implementation I had in mind would make VERBOSE OFF
> more or less a no-op, so that wouldn't get you there.)

+1.  Assuming we know which information the user wants on the basis of
their choice of output format seems like a bad idea.  I mean, suppose
we introduced a new option that gathered lots of additional detail but
made the query run 3x slower.  Would everyone want that enabled all
the time any time they chose a non-text format?  Probably not.

If people want BUFFERS turned on essentially all the time, then let's
just flip the default for that, so that EXPLAIN ANALYZE does the
equivalent of what EXPLAIN (ANALYZE, BUFFERS) currently does, and make
people say EXPLAIN (ANALYZE, BUFFERS OFF) if they don't want all that
detail.  I think that's more or less what Andres was suggesting
upthread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New EXPLAIN option: ALL

2019-05-08 Thread Robert Haas

On Tue, May 7, 2019 at 12:31 PM David Fetter  wrote:
> If you're tuning a query interactively, it's a lot simpler to prepend,
> for example,
>
> EXPLAIN (ALL, FORMAT JSON)
>
> to it than to prepend something along the lines of
>
> EXPLAIN(ANALYZE, VERBOSE, COSTS, BUFFERS, SETTINGS, TIMING, SUMMARY, 
> PARTRIDGE_IN_A_PEAR_TREE, FORMAT JSON)
>
> to it.

This is something of an exaggeration of what could ever be necessary,
because COSTS and TIMING default to TRUE and SUMMARY defaults to TRUE
when ANALYZE is specified, and the PARTRIDGE_IN_A_PEAR_TREE option
seems not to have made it into the tree this cycle.

But you could need EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS,
FORMAT JSON), which is not quite so long, but admittedly still
somewhat long.  Flipping some of the defaults seems like it might be
the way to go.  I think turning SETTINGS and BUFFERS on by default
would be pretty sensible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pg12 release notes

2019-05-08 Thread Justin Pryzby

I noticed you added release notes at bdf595adbca195fa54a909c74a5233ebc30641a1,
thanks for writing them.

I reviewed notes; find proposed changes attached+included.

I think these should also be mentioned?

f7cb284 Add plan_cache_mode setting
a6da004 Add index_get_partition convenience function
387a5cf Add pg_dump --on-conflict-do-nothing option.
17f206f Set pg_class.relhassubclass for partitioned indexes

diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
index 88bdcbd..ab4d1b3 100644
--- a/doc/src/sgml/release-12.sgml
+++ b/doc/src/sgml/release-12.sgml
@@ -61,8 +61,8 @@ Remove the special behavior of OID columns (Andres Freund, 
John Naylor)
 
 
 Previously, a normally-invisible OID column could be specified during table 
creation using WITH OIDS;  that ability has been removed. Columns can still be 
explicitly
-specified as type OID.  pg_dump and pg_upgrade operations on databases using 
WITH OIDS will need adjustment.  Many system tables now have an 'oid' column 
that will be
-expanded with SELECT * by default.
+specified as type OID.  pg_dump and pg_upgrade operations on databases using 
WITH OIDS will need adjustment.  The 'oid' column of many system tables will be
+shown by default with SELECT *.
 
 
 
@@ -115,7 +115,7 @@ Do not allow multiple different recovery_target* 
specifications (Peter Eisentrau
 
 
 
-Previously multiple different recovery_target* variables could be specified, 
and last one specified was honored.  Now, only one can be specified, though the 
same one can
+Previously multiple different recovery_target* variables could be specified, 
and the last one specified was honored.  Now, only one can be specified, though 
the same one can
 be specified multiple times and the last specification is honored.
 
 
@@ -405,7 +405,7 @@ Author: Robert Haas 
 -->
 
 
-Allow ATTACH PARTITION to be performed with reduced locking requirements 
(Robert Haas)
+ATTACH PARTITION is performed with lower locking requirement (Robert Haas)
 
 
 
@@ -617,7 +617,7 @@ Have new btree indexes sort duplicate index entries in 
heap-storage order (Peter
 
 
 
-Btree indexes pg_upgraded from previous releases will not have this ordering.  
This does slightly reduce the maximum length of indexed values.
+Btree indexes pg_upgraded from previous releases will not have this ordering.  
This slightly reduces the maximum permitted length of indexed values.
 
 
 
@@ -676,7 +676,7 @@ Allow CREATE STATISTICS to create most-common-value 
statistics for multiple colu
 
 
 
-This improve optimization for columns with non-uniform distributions that 
often appear in WHERE clauses.
+This improves query plans for columns with non-uniform distributions that 
often appear in WHERE clauses.
 
 
 
@@ -954,21 +954,6 @@ This dramatically speeds up processing of floating-point 
values.  Users who wish
 
 
 
-
-
-Avoid creation of the free space map files for small table (John Naylor, Amit 
Kapila)
-
-
-
-Such files are not useful.
-
-
-
-
-
 
 
-Allow more comparisons with information_schema text columns use indexes (Tom 
Lane)
+Allow more comparisons with information_schema text columns to use indexes 
(Tom Lane)
 
 
 
@@ -1310,7 +1295,7 @@ Author: Thomas Munro 
 -->
 
 
-Allow discovery of the LDAP server using DNS (Thomas Munro)
+Allow discovery of the LDAP server using DNS SRV records (Thomas Munro)
 
 
 
@@ -1446,7 +1431,7 @@ Add wal_recycle and wal_init_zero server variables to 
avoid WAL file recycling (
 
 
 
-This can be beneficial on copy-on-write file systems like ZFS.
+This can be beneficial on copy-on-write filesystems like ZFS.
 
 
 
@@ -1502,7 +1487,7 @@ Add server variable to control the type of shared memory 
to use (Andres Freund)
 
 
 
-The variable is shared_memory_type.  It purpose is to allow selection of 
System V shared memory, if desired.
+The variable is shared_memory_type.  Its purpose is to allow selection of 
System V shared memory, if desired.
 
 
 
>From a2d7fccc8cd8035898a4459e4b7176ef1168f713 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 8 May 2019 15:22:36 -0500
Subject: [PATCH v1] review pg12 release notes

---
 doc/src/sgml/release-12.sgml | 37 +++--
 1 file changed, 11 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
index 88bdcbd..ab4d1b3 100644
--- a/doc/src/sgml/release-12.sgml
+++ b/doc/src/sgml/release-12.sgml
@@ -61,8 +61,8 @@ Remove the special behavior of OID columns (Andres Freund, John Naylor)
 
 
 Previously, a normally-invisible OID column could be specified during table creation using WITH OIDS;  that ability has been removed. Columns can still be explicitly
-specified as type OID.  pg_dump and pg_upgrade operations on databases using WITH OIDS will need adjustment.  Many system tables now have an 'oid' column that will be
-expanded with SELECT * by default.
+specified as type OID.  pg_dump and pg_upgrade operations on databases using WITH OIDS will need adjustment.  The

Re: pg12 release notes

2019-05-08 Thread Alvaro Herrera

On 2019-May-08, Justin Pryzby wrote:

> I noticed you added release notes at bdf595adbca195fa54a909c74a5233ebc30641a1,
> thanks for writing them.
> 
> I reviewed notes; find proposed changes attached+included.
> 
> I think these should also be mentioned?
> 
> a6da004 Add index_get_partition convenience function

I don't disagree with the other three you suggest, but this one is a C
function and I think it's below the level of usefulness that we publish
in relnotes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch to document base64 encoding

2019-05-08 Thread Karl O. Pinc

Er, ping.  Nobody has reviewed the latest patchs.
They still apply to master...

I am re-attaching the patches.  See descriptions
below.

On Mon, 11 Mar 2019 15:32:14 -0500
"Karl O. Pinc"  wrote:

> On Sun, 10 Mar 2019 08:15:35 +0100 (CET)
> Fabien COELHO  What's causing problems here is that the encode and decode
> functions are listed in both the string functions section
> and the binary functions section.  A related but not-relevant
> problem is that there are functions listed in the string
> function section which take binary input.
> 
> I asked about this on IRC and the brief reply was
> unflattering to the existing documentation.
> 
> So I'm going to fix this also.  3 patches attached:
> 
> doc_base64_part1_v9.patch
> 
>   This moves functions taking bytea and other non-string
>   input into the binary string section, and vice versa.
>   Eliminates duplicate encode() and decode() documentation.
> 
>   Affects: convert(bytea, name, name)
>convert_from(bytea, name)
>encode(bytea, text)
>length(bytea, name)
>quote_nullable(anytype)
>to_hex(int or bigint)
>decode(text, text)
> 
>   Only moves, eliminates duplicates, and adjusts indentation.
> 
> 
> doc_base64_part2_v9.patch
> 
>   Cleanup wording after moving functions between sections.
> 
> 
> doc_base64_part3_v9.patch
> 
>   Documents base64, hex, and escape encode() and decode()
>   formats.
> 
> > >> "The string and the binary encode and decode functions..."
> > >> sentence looks strange to me, especially with the English
> > >> article that I do not really master, so maybe it is ok. I'd have
> > >> written something more straightforward, eg: "Functions encode
> > >> and decode support the following encodings:",
> > >
> > > It is an atypical construction because I want to draw attention
> > > that this is documentation not only for the encode() and decode()
> > > in section 9.4. String Functions and Operators but also for the
> > > encode() and decode in section 9.5. Binary String Functions and
> > > Operators. Although I can't think of a better approach it makes me
> > > uncomfortable that documentation written in one section applies
> > > equally to functions in a different section.
> > 
> > People coming from the binary doc would have no reason to look at
> > the string paragraph anyway.
> >   
> > > Do you think it would be useful to hyperlink the word "binary"
> > > to section 9.5?
> > 
> > Hmmm... I think that the link is needed in the other direction.  
> 
> I'm not sure what you mean here or if it's still relevant.
> 
> > I'd suggest (1) to use a simpler and direct sentence in the string 
> > section, (2) to simplify/shorten the in cell description in the
> > binary section, and (3) to add an hyperlink from the binary section
> > which would point to the expanded explanation in the string section.
> >   
> > > The idiomatic phrasing would be "Both the string and the binary
> > > encode and decode functions..." but the word "both" adds
> > > no information.  Shorter is better.
> > 
> > Possibly, although "Both" would insist on the fact that it applies
> > to the two variants, which was your intention.  
> 
> I think this is no longer relevant.  Although I'm not sure what
> you mean by 3.  The format names already hyperlink back to the
> string docs.
> 
> > >> and also I'd use a direct "Function
> > >> <...>decode ..." rather than "The
> > >> decode function ..." (twice).
> > >
> > > The straightforward English would be "Decode accepts...".  The
> > > problem is that this begins the sentence with the name of a
> > > function. This does not work very well when the function name is
> > > all lower case, and can have other problems where clarity is lost
> > > depending on documentation output formatting.
> > 
> > Yep.
> >   
> > > I don't see a better approach.
> > 
> > I suggested "Function <>decode ...", which is the kind of thing
> > we do in academic writing to improve precision, because I thought it
> > could be better:-)  
> 
> "Function <>decode ..." just does not work in English.
> 
> > >> Maybe I'd use the exact same grammatical structure for all 3
> > >> cases, starting with "The <>whatever encoding converts bla
> > >> bla bla" instead of varying the sentences.
> > >
> > > Agreed.  Good idea.  The first paragraph of each term has to
> > > do with encoding and the second with decoding.
> > 
> >   
> > > Uniformity in starting the second paragraphs helps make
> > > this clear, even though the first paragraphs are not uniform.
> > > With this I am not concerned that the first paragraphs
> > > do not have a common phrasing that's very explicit about
> > > being about encoding.
> > >
> > > Adjusted.
> > 
> > Cannot see it fully in the v8 patch:
> > 
> >   - The base64 encoding is
> >   - hex represents
> >   - escape converts  
> 
> I did only the decode paras.  I guess no reason not to make
> the first paras uniform as well.   Done

Re: vacuumdb and new VACUUM options

2019-05-08 Thread Euler Taveira

Em qua, 8 de mai de 2019 às 14:19, Fujii Masao  escreveu:
>
> The question is; we should support vacuumdb option for (1), i.e.,,
> something like --index-cleanup option is added?
> Or for (2), i.e., something like --disable-index-cleanup option is added
> as your patch does? Or for both?
>
--index-cleanup=BOOL


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: New EXPLAIN option: ALL

2019-05-08 Thread Vik Fearing

On 07/05/2019 09:30, David Fetter wrote:
> Folks,
> 
> It can get a little tedious turning on (or off) all the boolean
> options to EXPLAIN, so please find attached a shortcut.

I would rather have a set of gucs such as default_explain_buffers,
default_explain_summary, and default_explain_format.

Of course if you default BUFFERS to on(*) and don't do ANALYZE, that
should not result in an error.

(*) Defaulting BUFFERS to on is something I want regardless of anything
else we do.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Re: Pluggable Storage - Andres's take

2019-05-08 Thread Andres Freund

Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:
> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund  wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
> 
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.

I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?


> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.


> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > >This should imo be moved to the tableam callback. It's currently done
> > >a bit weirdly imo, with fdws computing relpages the callback, but
> > >then also returning the acquirefunc. Seems like it should entirely be
> > >computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
> 
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.



> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best.  Nor how we'd avoid code duplication...
> 
> Yes, will have to give more thoughts into this.
> 
> >
> > > 3) nodeTidscan, skipping over too large tids
> > >I think this should just be moved into the AMs, there's no need to
> > >have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> > /*
> >  * We silently discard any TIDs that are out of range at the time 
> > of scan
> >  * start.  (Since we hold at least AccessShareLock on the table, it 
> > won't
> >  * be possible for someone to truncate away the blocks we intend to
> >  * visit.)
> >  */
> > nblocks = 
> > RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> 
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.


> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a paramete

Re: Pluggable Storage - Andres's take

2019-05-08 Thread Andres Freund

Hi,

On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
> > Also wish to point out, while working on Zedstore, we realized that
> > TupleDesc from Relation object can be trusted at AM layer for
> > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > catalog is updated first and hence the relation object passed to AM
> > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > Only TupleDesc which can trusted and matches the on-disk layout of the
> > tuple for scans hence is from TupleTableSlot. Which is little
> > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > and not in scan_begin(). Means if AM wishes to do some initialization
> > based on TupleDesc for scans can't be done in scan_begin() and forced
> > to delay till has access to TupleTableSlot. We should at least add
> > comment for scan_begin() to strongly clarify not to trust Relation
> > object TupleDesc. Or maybe other alternative would be have separate
> > API for rewrite case.
> 
> Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> be trusted at AM layer for scan_begin() API.
> 
> Andres, any thoughts on above. I see you had proposed "change the
> table_beginscan* API so it
> provides a slot" in [1], but seems received no response/comments that time.
> [1]
> https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de

I don't think passing a slot at beginscan time is a good idea. There's
several places that want to use different slots for the same scan, and
we probably want to increase that over time (e.g. for batching), not
decrease it.

What kind of initialization do you want to do based on the tuple desc at
beginscan time?

Greetings,

Andres Freund

Re: Inconsistency between table am callback and table function names

2019-05-08 Thread Andres Freund

Hi,

On 2019-05-08 00:32:22 -0700, Ashwin Agrawal wrote:
> The general theme for table function names seem to be
> "table_". For example table_scan_getnextslot() and its
> corresponding callback scan_getnextslot(). Most of the table functions and
> callbacks follow mentioned convention except following ones
> 
>  table_beginscan
>  table_endscan
>  table_rescan
>  table_fetch_row_version
>  table_get_latest_tid
>  table_insert
>  table_insert_speculative
>  table_complete_speculative
>  table_delete
>  table_update
>  table_lock_tuple
> 
> the corresponding callback names for them are
> 
>  scan_begin
>  scan_end
>  scan_rescan

The mismatch here is just due of backward compat with the existing
function names.


>  tuple_fetch_row_version
>  tuple_get_latest_tid

Hm, I'd not object to adding a tuple_ to the wrapper.


>  tuple_insert
>  tuple_insert_speculative
>  tuple_delete
>  tuple_update
>  tuple_lock

That again is to keep the naming similar to the existing functions.



> Also, some of these table function names read little odd
> 
> table_relation_set_new_filenode
> table_relation_nontransactional_truncate
> table_relation_copy_data
> table_relation_copy_for_cluster
> table_relation_vacuum
> table_relation_estimate_size
> 
> Can we drop relation word from callback names and as a result from these
> function names as well? Just have callback names as set_new_filenode,
> copy_data, estimate_size?

I'm strongly against that. These all work on a full relation size,
rather than on individual tuples, and that seems worth pointing out.


> Also, a question about comments. Currently, redundant comments are written
> above callback functions as also above table functions. They differ
> sometimes a little bit on descriptions but majority content still being the
> same. Should we have only one place finalized to have the comments to keep
> them in sync and also know which one to refer to?

Note that the non-differing comments usually just refer to the other
place. And there's legitimate differences in most of the ones that are
both at the callback and the external functions - since the audience of
both are difference, that IMO makes sense.


> Plus, file name amapi.h now seems very broad, if possible should be renamed
> to indexamapi.h or indexam.h to follow tableam.h. No idea what's our policy
> around header file renames.

We probably should rename it, but not in 12...

Greetings,

Andres Freund

Problems with pg_upgrade and extensions referencing catalog tables/views

2019-05-08 Thread Nasby, Jim

pgTap has a view that references pg_proc; to support introspection of functions 
and aggregates. That view references proisagg in versions < 11, and prokind in 
11+. pgtap's make process understands how to handle this; modifying the 
creation scripts as necessary. It actually has to do this for several functions 
as well.

The problem is that pg_dump --binary-upgrade intentionally does not simply 
issue a `CREATE EXTENSION` command the way a normal dump does, so that it can 
control the OIDs that are assigned to objects[1]. That means that attempting to 
pg_upgrade a database with pgtap installed to version 11+ fails trying to 
create the view that references pg_proc.proisagg[2].

For pgtap, we should be able to work around this by removing the offending 
column from the view and embedding the knowledge in a function. This would be 
more difficult in other types of extensions though, especially any that are 
attempting to provide more user-friendly views of catalog tables.

I don’t recall why pg_upgrade wants to control OIDs… don’t we re-create all 
catalog entries for user objects from scratch?

1: 
https://www.postgresql.org/message-id/AANLkTimm1c64=xkdpz5ji7q-rh69zd3cmewmrpkh0...@mail.gmail.com
2: https://github.com/theory/pgtap/issues/201

Re: New EXPLAIN option: ALL

2019-05-08 Thread Nasby, Jim

> On May 8, 2019, at 4:22 PM, Vik Fearing  wrote:
> 
> On 07/05/2019 09:30, David Fetter wrote:
>> Folks,
>> 
>> It can get a little tedious turning on (or off) all the boolean
>> options to EXPLAIN, so please find attached a shortcut.
> 
> I would rather have a set of gucs such as default_explain_buffers,
> default_explain_summary, and default_explain_format.
> 
> Of course if you default BUFFERS to on(*) and don't do ANALYZE, that
> should not result in an error.
> 
> (*) Defaulting BUFFERS to on is something I want regardless of anything
> else we do.

I think this, plus Tom’s suggesting of changing what VERBOSE does, is the best 
way to handle this. Especially since VERBOSE is IMHO pretty useless...

I’m +1 on trying to move away from ANALYZE as well, though I think it’s mostly 
orthogonal...

Re: _bt_split(), and the risk of OOM before its critical section

2019-05-08 Thread Peter Geoghegan

On Tue, May 7, 2019 at 6:15 PM Peter Geoghegan  wrote:
> I suppose I'm biased, but I prefer the new approach anyway. Adding the
> left high key first, and then the right high key seems simpler and
> more logical. It emphasizes the similarities and differences between
> leftpage and rightpage.

I came up with a better way of doing it in the attached revision. Now,
_bt_split() calls _bt_findsplitloc() directly. This makes it possible
to significantly simplify the signature of _bt_split().

It makes perfect sense for _bt_split() to call _bt_findsplitloc()
directly, since _bt_findsplitloc() is already aware of almost every
_bt_split() implementation detail, whereas those same details are not
of interest anywhere else. _bt_findsplitloc() also knows all about
suffix truncation. It's also nice that the actual _bt_truncate() call
is closely tied to the _bt_findsplitloc() call.

-- 
Peter Geoghegan

v2-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch
Description: Binary data

Re: Problems with pg_upgrade and extensions referencing catalog tables/views

2019-05-08 Thread Tom Lane

"Nasby, Jim"  writes:
> The problem is that pg_dump --binary-upgrade intentionally does not
> simply issue a `CREATE EXTENSION` command the way a normal dump does, so
> that it can control the OIDs that are assigned to objects[1].

That's not the only reason.  The original concerns were about not
breaking the extension, in case the destination server had a different
version of the extension available.  CREATE EXTENSION doesn't normally
guarantee that you get an exactly compatible extension version, which
is a good thing for regular pg_dump and restore but a bad thing
for binary upgrade.

I'm not really sure how to improve the situation you describe, but
"issue CREATE EXTENSION and pray" doesn't sound like a solution.

regards, tom lane

Re: Inconsistency between table am callback and table function names

2019-05-08 Thread Ashwin Agrawal

On Wed, May 8, 2019 at 2:51 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-08 00:32:22 -0700, Ashwin Agrawal wrote:
> > The general theme for table function names seem to be
> > "table_". For example table_scan_getnextslot() and its
> > corresponding callback scan_getnextslot(). Most of the table functions and
> > callbacks follow mentioned convention except following ones
> >
> >  table_beginscan
> >  table_endscan
> >  table_rescan
> >  table_fetch_row_version
> >  table_get_latest_tid
> >  table_insert
> >  table_insert_speculative
> >  table_complete_speculative
> >  table_delete
> >  table_update
> >  table_lock_tuple
> >
> > the corresponding callback names for them are
> >
> >  scan_begin
> >  scan_end
> >  scan_rescan
>
> The mismatch here is just due of backward compat with the existing
> function names.

I am missing something here, would like to know more. table_ seem all
new fresh naming. Hence IMO having consistency with surrounding and
related code carries more weight as I don't know backward compat
serving what purpose. Heap function names can continue to call with
same old names for backward compat if required.

> > Also, a question about comments. Currently, redundant comments are written
> > above callback functions as also above table functions. They differ
> > sometimes a little bit on descriptions but majority content still being the
> > same. Should we have only one place finalized to have the comments to keep
> > them in sync and also know which one to refer to?
>
> Note that the non-differing comments usually just refer to the other
> place. And there's legitimate differences in most of the ones that are
> both at the callback and the external functions - since the audience of
> both are difference, that IMO makes sense.
>

Not having consistency is the main aspect I wish to bring to
attention. Like for some callback functions the comment is

/* see table_insert() for reference about parameters */
void(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 CommandId cid, int options,
 struct BulkInsertStateData *bistate);

/* see table_insert_speculative() for reference about parameters
*/
void(*tuple_insert_speculative) (Relation rel,
 TupleTableSlot *slot,
 CommandId cid,
 int options,
 struct
BulkInsertStateData *bistate,
 uint32 specToken);

Whereas for some other callbacks the parameter explanation exist in
both the places. Seems we should be consistent.
I feel in long run becomes pain to keep them in sync as comments
evolve. Like for example

/*
 * Estimate the size of shared memory needed for a parallel scan
of this
 * relation. The snapshot does not need to be accounted for.
 */
Size(*parallelscan_estimate) (Relation rel);

parallescan_estimate is not having snapshot argument passed to it, but
table_parallescan_estimate does. So, this way chances are high they
going out of sync and missing to modify in both the places. Agree
though on audience being different for both. So, seems going with the
refer XXX for parameters seems fine approach for all the callbacks and
then only specific things to flag out for the AM layer to be aware can
live above the callback function.

> > Plus, file name amapi.h now seems very broad, if possible should be renamed
> > to indexamapi.h or indexam.h to follow tableam.h. No idea what's our policy
> > around header file renames.
>
> We probably should rename it, but not in 12...

Okay good to know.

RE: Patch: doc for pg_logical_emit_message()

2019-05-08 Thread Matsumura, Ryo

On Thu. May. 9, 2019 at 01:48 AM Masao, Fujii
 wrote:

> Thanks! Pushed.

Thank you.


Regards
Ryo Matsumura

Re: vacuumdb and new VACUUM options

2019-05-08 Thread Michael Paquier

On Wed, May 08, 2019 at 06:21:09PM -0300, Euler Taveira wrote:
> Em qua, 8 de mai de 2019 às 14:19, Fujii Masao  
> escreveu:
>> The question is; we should support vacuumdb option for (1), i.e.,,
>> something like --index-cleanup option is added?
>> Or for (2), i.e., something like --disable-index-cleanup option is added
>> as your patch does? Or for both?
>
> --index-cleanup=BOOL

I agree with Euler's suggestion to have a 1-1 mapping between the
option of vacuumdb and the VACUUM parameter, because that's more
intuitive:
- --index-cleanup=3Dfalse =3D> VACUUM (INDEX_CLEANUP=3Dfalse)
- --index-cleanup=3Dtrue =3D> VACUUM (INDEX_CLEANUP=3Dtrue)
- no --index-cleanup means to rely on the reloption.
--
Michael


signature.asc
Description: PGP signature

Re: Wrong return code in vacuumdb when multiple jobs are used

2019-05-08 Thread Michael Paquier

On Sat, May 04, 2019 at 11:48:53AM -0400, Tom Lane wrote:
> +1, waiting till after the minor releases are tagged seems wisest.
> We can still push it before 12beta1, so it will get tested in the beta
> period.

The new minor releases have been tagged, so committed.
--
Michael


signature.asc
Description: PGP signature

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Kyotaro HORIGUCHI

Hello. There is an unfortunate story on this issue.

At Wed, 8 May 2019 14:56:25 -0400, Andrew Dunstan 
 wrote in 
<7969b496-096a-bf9b-2a03-4706baa4c...@2ndquadrant.com>
> 
> On 5/8/19 12:41 PM, Greg Stark wrote:
> > Don't we have a build farm animal that runs under valgrind that would
> > have caught this?
> >
> >
> 
> There are two animals running under valgrind: lousyjack and skink.

Valgrind doesn't detect the overruning read since the block
doesn't has 'MEMNOACCESS' region, since the requested size is
just 64 bytes.

Thus the attached patch let valgrind detect the overrun.

==00:00:00:22.959 20254== VALGRINDERROR-BEGIN
==00:00:00:22.959 20254== Conditional jump or move depends on uninitialised 
value(s)
==00:00:00:22.959 20254==at 0x88A838: ExecInterpExpr (execExprInterp.c:1553)
==00:00:00:22.959 20254==by 0x88AFD5: ExecInterpExprStillValid 
(execExprInterp.c:1769)
==00:00:00:22.959 20254==by 0x8C3503: ExecEvalExprSwitchContext 
(executor.h:307)
==00:00:00:22.959 20254==by 0x8C4653: advance_aggregates (nodeAgg.c:679)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index d01fc4f52e..7c6eab6d94 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2935,7 +2935,7 @@ build_pertrans_for_aggref(AggStatePerTrans pertrans,
 		fmgr_info_set_expr((Node *) combinefnexpr, &pertrans->transfn);
 
 		pertrans->transfn_fcinfo =
-			(FunctionCallInfo) palloc(SizeForFunctionCallInfo(2));
+			(FunctionCallInfo) palloc(SizeForFunctionCallInfo(2) + 1);
 		InitFunctionCallInfoData(*pertrans->transfn_fcinfo,
  &pertrans->transfn,
  2,

Re: Fuzzy thinking in is_publishable_class

2019-05-08 Thread Tom Lane

I wrote:
> is_publishable_class has a test "relid >= FirstNormalObjectId",
> which I think we should drop, for two reasons:
> ...
> So what is the motivation for this test?  If there's an important
> reason for it, we need to find a less fragile way to express it.

I tried removing the FirstNormalObjectId check, and found that the
reason for it seems to be "the subscription/t/004_sync.pl test
falls over without it".  That's because that test supposes that
the *only* entry in pg_subscription_rel will be for the test table
that it creates.  Without the FirstNormalObjectId check, the
information_schema relations also show up in pg_subscription_rel,
confusing the script's simplistic status check.

I'm of two minds what to do about that.  One approach is to just
define a "FOR ALL TABLES" publication as including the information_schema
tables, in which case 004_sync.pl is wrong and we should fix it by
adding a suitable WHERE restriction to its pg_subscription_rel check.
However, possibly that would break some applications that are likewise
assuming that no built-in tables appear in pg_subscription_rel.

But, if what we want is the definition that "information_schema is
excluded from publishable tables", I'm not satisfied with this
implementation of that rule.  Dropping/recreating information_schema
would cause the behavior to change.  We could, at the cost of an
additional syscache lookup, check the name of the schema that a
potentially publishable table belongs to and exclude information_schema
by name.  I don't have much idea about how performance-critical
is_publishable_class is, so I don't know how acceptable that seems.

regards, tom lane

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-08 Thread Tom Lane

Michael Paquier  writes:
> No problem to do that.  I'll brush up all that once you commit the
> first piece you have come up with, and reuse the new API of catalog.c
> you are introducing based on the table OID.

Pushed my stuff, have at it.

regards, tom lane

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-08 Thread Michael Paquier

On Wed, May 08, 2019 at 11:28:35PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> No problem to do that.  I'll brush up all that once you commit the
>> first piece you have come up with, and reuse the new API of catalog.c
>> you are introducing based on the table OID.
> 
> Pushed my stuff, have at it.

Thanks.  Attached is what I get to after scanning the error messages
in indexcmds.c and index.c.  Perhaps you have more comments about it?
--
Michael
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dab06e20a8..7e7c03ef12 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2743,6 +2743,12 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 
 MemoryContextSwitchTo(oldcontext);
 
+/* A system catalog cannot be reindexed concurrently */
+if (IsCatalogRelationOid(relationOid))
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("cannot reindex a system catalog concurrently")));
+
 /* Open relation to get its indexes */
 heapRelation = table_open(relationOid, ShareUpdateExclusiveLock);
 
@@ -2756,13 +2762,13 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	if (!indexRelation->rd_index->indisvalid)
 		ereport(WARNING,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot reindex concurrently invalid index \"%s.%s\", skipping",
+ errmsg("cannot reindex invalid index \"%s.%s\" concurrently, skipping",
 		get_namespace_name(get_rel_namespace(cellOid)),
 		get_rel_name(cellOid;
 	else if (indexRelation->rd_index->indisexclusion)
 		ereport(WARNING,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot reindex concurrently exclusion constraint index \"%s.%s\", skipping",
+ errmsg("cannot reindex exclusion constraint index \"%s.%s\" concurrently, skipping",
 		get_namespace_name(get_rel_namespace(cellOid)),
 		get_rel_name(cellOid;
 	else
@@ -2802,7 +2808,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		if (!indexRelation->rd_index->indisvalid)
 			ereport(WARNING,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("cannot reindex concurrently invalid index \"%s.%s\", skipping",
+	 errmsg("cannot reindex invalid index \"%s.%s\" concurrently, skipping",
 			get_namespace_name(get_rel_namespace(cellOid)),
 			get_rel_name(cellOid;
 		else
@@ -2831,17 +2837,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			{
 Oid			heapId = IndexGetRelation(relationOid, false);
 
-/* A shared relation cannot be reindexed concurrently */
-if (IsSharedRelation(heapId))
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("concurrent reindex is not supported for shared relations")));
-
 /* A system catalog cannot be reindexed concurrently */
 if (IsCatalogRelationOid(heapId))
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("concurrent reindex is not supported for catalog relations")));
+			 errmsg("cannot reindex a system catalog concurrently")));
 
 /* Save the list of relation OIDs in private context */
 oldcontext = MemoryContextSwitchTo(private_context);
@@ -2869,7 +2869,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			/* Return error if type of relation is not supported */
 			ereport(ERROR,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot reindex concurrently this type of relation")));
+	 errmsg("cannot reindex this type of relation concurrently")));
 			break;
 	}
 
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 326dc44177..c8bc6be061 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1967,7 +1967,7 @@ INSERT INTO concur_reindex_tab3 VALUES  (3, '[1,2]');
 REINDEX INDEX CONCURRENTLY  concur_reindex_tab3_c2_excl;  -- error
 ERROR:  concurrent index creation for exclusion constraints is not supported
 REINDEX TABLE CONCURRENTLY concur_reindex_tab3;  -- succeeds with warning
-WARNING:  cannot reindex concurrently exclusion constraint index "public.concur_reindex_tab3_c2_excl", skipping
+WARNING:  cannot reindex exclusion constraint index "public.concur_reindex_tab3_c2_excl" concurrently, skipping
 INSERT INTO concur_reindex_tab3 VALUES  (4, '[2,4]');
 ERROR:  conflicting key value violates exclusion constraint "concur_reindex_tab3_c2_excl"
 DETAIL:  Key (c2)=([2,5)) conflicts with existing key (c2)=([1,3)).
@@ -2092,10 +2092,15 @@ BEGIN;
 REINDEX TABLE CONCURRENTLY concur_reindex_tab;
 ERROR:  REINDEX CONCURRENTLY cannot run inside a transaction block
 COMMIT;
-REINDEX TABLE CONCURRENTLY pg_database; -- no shared relation
-ERROR:  concurrent index creation on system catalog tables is not supported
-REINDEX TABLE CONCURRENTLY pg

Re: Implicit timezone conversion replicating from timestamp to timestamptz?

2019-05-08 Thread craig.ringer



On Friday, 25 January 2019 04:57:15 UTC+8, Jeremy Finzel wrote:
>
> We are working to migrate several large tables from the timestamp to the 
> timestamptz data type by using logical replication (so as to avoid long 
> downtime for type conversions).  We are using pglogical but curious if what 
> I share below applies to native logical replication as well.
>
> Both source and destination dbs are at localtime, which is 
> 'America/Chicago' time zone.
>
> The source system has a timestamp stored "at time zone UTC", like this for 
> 6:00pm Chicago time:
> 2019-01-24 20:00:00.00
>
> I was *very surprised* to find that replicating above timestamp to 
> timestamptz actually does so correctly, showing this value in my psql 
> client on the subscriber:
> 2019-01-24 14:00:00.00-06
>
> How does it know/why does it assume it knows that the time zone of the 
> timestamp data type is UTC on the provider given that my clusters are at 
> America/Chicago?  I would have actually expected an incorrect conversion of 
> the data unless I set the timezone to UTC on the way in on the subscriber 
> via a trigger.
>
> That is, I was expecting to see this:
> 2019-01-24 20:00:00.00-06
>
> Which is obviously wrong.  So why does it do this and is there some 
> assumption being made somewhere in the code base that a timestamp is 
> actually saved "at time zone UTC"?
>
>
pglogical is replicating the timestamp text, which is converted on both 
output and input.

Re: [HACKERS] proposal: schema variables

2019-05-08 Thread Pavel Stehule

Hi

rebased patch

Regards

Pavel


schema-variables-20190509.patch.gz
Description: application/gzip

Re: Patch to document base64 encoding

2019-05-08 Thread Fabien COELHO





Er, ping.  Nobody has reviewed the latest patchs.


Next CF is in July, two months away.

You might consider reviewing other people patches, that is expected to 
make the overall process work. There are several documentation or 
comment patches in the queue.


--
Fabien.

integrate Postgres Users Authentication with our own LDAP Server

2019-05-08 Thread M Tarkeshwar Rao

Hi all,

We would need to integrate Postgres Users Authentication with our own LDAP 
Server.

Basically as of now we are able to login to Postgress DB with a user/password 
credential.
[cid:image001.png@01D50650.D807AE30]

These user objects are the part of Postgres DB server. Now we want that these 
users should be authenticated by LDAP server.
We would want the authentication to be done with LDAP, so basically the user 
credentials should be store in LDAP server

Can you mention the prescribed steps in Postgres needed for this integration 
with LDAP Server?

Regards
Tarkeshwar

Re: range_agg

2019-05-08 Thread Paul A Jungwirth

On Mon, May 6, 2019 at 4:21 PM Paul Jungwirth
 wrote:
> I need to write some docs and do
> some cleanup and I'll have a CF entry.

Here is an initial patch. I'd love to have some feedback! :-)

One challenge was handling polymorphism, since I want to have this:

anyrange[] range_agg(anyrange, bool, bool)

But there is no anyrange[]. I asked about this back when I wrote my
extension too:

https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com

Like then, I handled it by overloading the function with separate
signatures for each built-in range type:

int4range[] range_agg(int4range, bool, bool);
int8range[] range_agg(int8range, bool, bool);
numrange[] range_agg(numrange, bool, bool);
tsrange[] range_agg(tsrange, bool, bool);
tstzrange[] range_agg(tstzrange, bool, bool);
daterange[] range_agg(daterange, bool, bool);

(And users can still define a range_agg for their own custom range
types using similar instructions to those in my extension's README.)

The problem was the opr_sanity regression test, which rejects
functions that share the same C-function implementation (roughly). I
took a few stabs at changing my code to pass that test, e.g. defining
separate wrapper functions for everything like this:

Datum
int4range_agg_transfn(PG_FUNCTION_ARGS) {
  return range_agg_transfn(fcinfo);
}

But that felt like it was getting ridiculous (8 range types *
transfn+finalfn * 1-bool and 2-bool variants), so in the end I just
added my functions to the "permitted" output in opr_sanity. Let me
know if I should avoid that though.

Also I would still appreciate some bikeshedding over the functions' UI
since I don't feel great about it.

If the overall approach seems okay, I'm still open to adding David's
suggestions for weighted_range_agg and covering_range_agg.

Thanks!
Paul

range_agg_v1.patch
Description: Binary data

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Kyotaro HORIGUCHI

At Thu, 09 May 2019 11:17:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190509.111746.217492977.horiguchi.kyot...@lab.ntt.co.jp>
> Valgrind doesn't detect the overruning read since the block
> doesn't has 'MEMNOACCESS' region, since the requested size is
> just 64 bytes.
> 
> Thus the attached patch let valgrind detect the overrun.

So the attached patch makes palloc always attach the MEMNOACCESS
region and sentinel byte. The issue under discussion is detected
with this patch either. (But in return memory usage gets larger.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5c25b69822fdd86d05871f61cf30e47f514853ae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 9 May 2019 13:18:33 +0900
Subject: [PATCH] Always put sentinel in allocated memory blocks.

Currently allocated memory blocks doesn't have sentinel byte and
valgrind NOACCESS region if requested size is equal to chunk size. The
behavior largily diminishes the chance of overrun dection. This patch
let allocated memory blocks are always armed with the stuff when
MEMORY_CONTEXT_CHECKING is defined.
---
 src/backend/utils/mmgr/aset.c   | 51 +
 src/backend/utils/mmgr/generation.c | 35 +++--
 src/backend/utils/mmgr/slab.c   | 23 ++---
 3 files changed, 64 insertions(+), 45 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 08aff333a4..15ef5c5afa 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -297,6 +297,13 @@ static const MemoryContextMethods AllocSetMethods = {
 #endif
 };
 
+/* to keep room for sentinel in allocated chunks */
+#ifdef MEMORY_CONTEXT_CHECKING
+#define REQUIRED_SIZE(x) ((x) + 1)
+#else
+#define REQUIRED_SIZE(x) (x)
+#endif
+
 /*
  * Table for AllocSetFreeIndex
  */
@@ -726,9 +733,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 * If requested size exceeds maximum for chunks, allocate an entire block
 	 * for this request.
 	 */
-	if (size > set->allocChunkLimit)
+	if (REQUIRED_SIZE(size) > set->allocChunkLimit)
 	{
-		chunk_size = MAXALIGN(size);
+		chunk_size = MAXALIGN(REQUIRED_SIZE(size));
 		blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
 		block = (AllocBlock) malloc(blksize);
 		if (block == NULL)
@@ -742,8 +749,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 #ifdef MEMORY_CONTEXT_CHECKING
 		chunk->requested_size = size;
 		/* set mark to catch clobber of "unused" space */
-		if (size < chunk_size)
-			set_sentinel(AllocChunkGetPointer(chunk), size);
+		Assert (size < chunk_size);
+		set_sentinel(AllocChunkGetPointer(chunk), size);
 #endif
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 		/* fill the allocated space with junk */
@@ -787,7 +794,7 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 * If one is found, remove it from the free list, make it again a member
 	 * of the alloc set and return its data address.
 	 */
-	fidx = AllocSetFreeIndex(size);
+	fidx = AllocSetFreeIndex(REQUIRED_SIZE(size));
 	chunk = set->freelist[fidx];
 	if (chunk != NULL)
 	{
@@ -800,8 +807,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 #ifdef MEMORY_CONTEXT_CHECKING
 		chunk->requested_size = size;
 		/* set mark to catch clobber of "unused" space */
-		if (size < chunk->size)
-			set_sentinel(AllocChunkGetPointer(chunk), size);
+		Assert (size < chunk->size);
+		set_sentinel(AllocChunkGetPointer(chunk), size);
 #endif
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 		/* fill the allocated space with junk */
@@ -824,7 +831,7 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 * Choose the actual chunk size to allocate.
 	 */
 	chunk_size = (1 << ALLOC_MINBITS) << fidx;
-	Assert(chunk_size >= size);
+	Assert(chunk_size >= REQUIRED_SIZE(size));
 
 	/*
 	 * If there is enough room in the active allocation block, we will put the
@@ -959,8 +966,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 #ifdef MEMORY_CONTEXT_CHECKING
 	chunk->requested_size = size;
 	/* set mark to catch clobber of "unused" space */
-	if (size < chunk->size)
-		set_sentinel(AllocChunkGetPointer(chunk), size);
+	Assert (size < chunk->size);
+	set_sentinel(AllocChunkGetPointer(chunk), size);
 #endif
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 	/* fill the allocated space with junk */
@@ -996,10 +1003,10 @@ AllocSetFree(MemoryContext context, void *pointer)
 
 #ifdef MEMORY_CONTEXT_CHECKING
 	/* Test for someone scribbling on unused space in chunk */
-	if (chunk->requested_size < chunk->size)
-		if (!sentinel_ok(pointer, chunk->requested_size))
-			elog(WARNING, "detected write past chunk end in %s %p",
- set->header.name, chunk);
+	Assert (chunk->requested_size < chunk->size);
+	if (!sentinel_ok(pointer, chunk->requested_size))
+		elog(WARNING, "detected write past chunk end in %s %p",
+			 set->header.name, chunk);
 #endif
 
 	if (chunk->size > set->allocChunkLimit)
@@ -1078,10 +1085,10 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 
 #ifdef

Re: New vacuum option to do only freezing

2019-05-08 Thread Masahiko Sawada

On Thu, May 2, 2019 at 1:39 AM Robert Haas  wrote:
>
> On Wed, May 1, 2019 at 12:14 PM Andres Freund  wrote:
> > On 2019-04-06 16:13:53 +0900, Michael Paquier wrote:
> > > On Sat, Apr 06, 2019 at 11:31:31AM +0900, Masahiko Sawada wrote:
> > > > Yes, but Fujii-san pointed out that this option doesn't support toast
> > > > tables and I think there is not specific reason why not supporting
> > > > them. So it might be good to add toast.vacuum_index_cleanup. Attached
> > > > patch.
> > >
> > > Being able to control that option at toast level sounds sensible.  I
> > > have added an open item about that.
> >
> > Robert, what is your stance on this open item? It's been an open item
> > for about three weeks, without any progress.
>
> The actual bug in this patch needs to be fixed, but I see we have
> another open item for that.  This open item, as I understand it, is
> all about whether we should add another reloption so that you can
> control this behavior separately for TOAST tables.  In my opinion,
> that's not a critical change and the open item should be dropped, but
> others might see it differently.

I agree that this item is neither critical and bug. But this is an
(my) oversight and is a small patch and I think there is no specific
reason why we don't dare to include this in 12. So if this patch could
get reviewed enough I think we can have it in 12. Since the previous
patch conflicts with current HEAD I've attached the rebased version
patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


toast_vacuum_index_cleanup_v2.patch
Description: Binary data

Re: Fwd: Add tablespace tap test to pg_rewind

2019-05-08 Thread Michael Paquier

On Fri, Mar 22, 2019 at 04:25:53PM +0800, Shaoqi Bai wrote:
> Deleted the test for group permissions in updated patch.

Well, there are a couple of things I am not really happy about in this
patch:
- There is not much point to have has_tablespace_mapping as it is not
extensible.  Instead I'd rather use a single "extra" parameter which
can define a range of options.  An example of that is within
PostgresNode::init for "extra" and "auth_extra".
- CREATE TABLESPACE is run once on the primary *before* promoting the
standby, which causes the tablespace paths to map between both of
them.  This is not correct.  Creating a tablespace on the primary
before creating the standby, and use --tablespace-map would be the way
to go...  However per the next point...
- standby_afterpromotion is created on the promoted standby, and then
immediately dropped.  pg_rewind is able to handle this case when
working on different hosts.  But with this test we finish by having
the same problem as pg_basebackup: the source and the target server
finish by eating each other.  I think that this could actually be an
interesting feature for pg_rewind.
- A comment at the end refers to databases, and not tablespaces.

You could work out the first problem with the backup by changing the
backup()/init_from_backup() in RewindTest::create_standby by a pure
call to pg_basebackup, but you still have the second problem, which we
should still be able to test, and this requires more facility in
pg_rewind so as it is basically possible to hijack
create_target_symlink() to create a symlink to a different path than
the initial one.

> Checking the RecursiveCopy::copypath being called, only _backup_fs and
> init_from_backup called it.
> After runing cmd make -C src/bin check in updated patch, seeing no failure.

Yes, I can see that.  The issue is that even if we do a backup with
--tablespace-mapping then we still need a tweak to allow the copy of
symlinks.  I am not sure that this is completely what we are looking
for either, as it means that any test setting a primary with a
tablespace and two standbys initialized from the same base backup
would fail.  That's not really portable.
--
Michael


signature.asc
Description: PGP signature

Re: any suggestions to detect memory corruption

2019-05-08 Thread Alex

Thanks you Tom and Robert!   I tried valgrind,  and looks it help me fix
the issue.

Someone add some code during backend init which used palloc. but at that
time,  the CurrentMemoryContext is PostmasterContext.   at the end of
backend initialization, the PostmasterContext is deleted, then the error
happens.  the reason why it happens randomly is before the palloc, there
are some other if clause which may skip the palloc.

I still can't explain why PostmasterContext may have impact "index info"
MemoryContext sometime,  but now I just can't reproduce it (before the
fix,  it may happen in 30% cases).

On Thu, May 9, 2019 at 1:21 AM Robert Haas  wrote:

> On Wed, May 8, 2019 at 10:34 AM Tom Lane  wrote:
> > Alex  writes:
> > > I can get the following log randomly and I am not which commit caused
> it.
> >
> > > 2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set
> index
> > > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
> >
> > I've had success in finding memory stomp causes fairly quickly by setting
> > a hardware watchpoint in gdb on the affected location.  Then you just let
> > it run to see when the value changes, and check whether that's a "legit"
> > or "not legit" modification point.
> >
> > The hard part of that, of course, is to know in advance where the
> affected
> > location is.  You may be able to make things sufficiently repeatable by
> > doing the problem query in a fresh session each time.
>
> valgrind might also be a possibility, although that has a lot of overhead.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: Add tablespace tap test to pg_rewind

2019-05-08 Thread Michael Paquier

On Tue, May 07, 2019 at 10:31:59AM +0900, Kyotaro HORIGUCHI wrote:
> The patch seems to be using the tablespace directory in backups
> directly from standbys. In other words, multiple standbys created
> from A backup shares the tablespace directory in the backup.

Yes, I noticed that, and I am not happy about that either.  I'd like
to think that what we are looking for is an equivalent of the
tablespace mapping of pg_basebackup, but for init_from_backup().  At
least that sounds like a plausible solution.
--
Michael


signature.asc
Description: PGP signature

Re: Unexpected "shared memory block is still in use"

2019-05-08 Thread Noah Misch

On Wed, May 08, 2019 at 02:32:46PM -0400, Tom Lane wrote:
> Just now, while running a parallel check-world on HEAD according to the
> same script I've been using for quite some time, one of the TAP tests
> died during initdb:
> 
> selecting dynamic shared memory implementation ... posix
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting default timezone ... America/New_York
> creating configuration files ... ok
> running bootstrap script ... ok
> performing post-bootstrap initialization ... 2019-05-08 13:59:19.963 EDT 
> [18351] FATAL:  pre-existing shared memory block (key 5440004, ID 1734475802) 
> is still in use
> 2019-05-08 13:59:19.963 EDT [18351] HINT:  Terminate any old server processes 
> associated with data directory 
> "/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata".
> child process exited with exit code 1
> initdb: removing data directory 
> "/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata"
> Bail out!  system initdb failed
> 
> I have never seen this happen before in the TAP tests.
> 
> I think the odds are very high that this implies something wrong with
> commit c09850992.

The odds are very high that you would not have gotten that error before that
commit.  But if the cause matches your guess, it's not something wrong with
the commit ...

> My immediate guess after eyeballing that patch quickly is that it was
> not a good idea to redefine the rules used by bootstrap/standalone
> backends.  In particular, it seems somewhat plausible that the bootstrap
> process hadn't yet completely died when the standalone backend for the
> post-bootstrap phase came along and decided there was a conflict (which
> it never would have before).

If so, I would sure try to fix the initdb sequence to not let that happen.  I
would not trust such a conflict to be harmless.

What OS, OS version, and filesystem?

Re: relcache reference leak with pglogical replication to insert-only partitioned table?

2019-05-08 Thread craig.ringer



On Sunday, 27 January 2019 11:20:03 UTC+8, Jeremy Finzel wrote:
>
> I understand it's not fully supported to replicate to a differently 
> partitioned setup on a subscriber with either pglogical or the native 
> logical replication, however I also know that INSERT triggers can be fired 
> in replication mode.  I have an insert-only OLTP table that I want 
> partitioned only on the subscriber system.  I have this setup using the 
> "old style" partitioning as it is a 9.6 system.
>
> Provider is 9.6.6 pglogical 2.1.1
> Subscriber is 9.6.10 pglogical 2.1.1
>
> Everything appears good as far as the data.  It is partitioning 
> correctly.  Queries on the data are planning correctly.  However, I am now 
> getting these WARNING messages constantly.  How concerned should I be?  Is 
> there a fix for this?  Any insight is much appreciated!
>
> 2019-01-27 03:12:34.150 GMT,,,135600,,5c4d1f44.211b0,6794,,2019-01-27 
> 03:02:28 GMT,54/0,1057372660,WARNING,01000,"relcache reference leak: 
> relation ""foo_pkey"" not closed","apply COMMIT in commit before 
> 14DB/34DB1B78, xid 1476598649 commited at 2019-01-26 21:12:34.071673-06 
> (action #10) from node replorigin 22""pglogical apply 16420:2094659706"
>

It won't corrupt anything but you can expect resource leaks.

I think there are changes to support this in pglogical3. I don't have any 
new information on when they might become public, but I do know work has 
been done to add a plugin mechanism etc as part of work toward opening 
pglogical3.

Re: Wrong return code in vacuumdb when multiple jobs are used

2019-05-08 Thread Julien Rouhaud

On Thu, May 9, 2019 at 3:32 AM Michael Paquier  wrote:
>
> On Sat, May 04, 2019 at 11:48:53AM -0400, Tom Lane wrote:
> > +1, waiting till after the minor releases are tagged seems wisest.
> > We can still push it before 12beta1, so it will get tested in the beta
> > period.
>
> The new minor releases have been tagged, so committed.

Thanks a lot!

Re: Unexpected "shared memory block is still in use"

2019-05-08 Thread Kyotaro HORIGUCHI

At Wed, 8 May 2019 22:54:14 -0700, Noah Misch  wrote in 
<20190509055414.gb1066...@rfd.leadboat.com>
> On Wed, May 08, 2019 at 02:32:46PM -0400, Tom Lane wrote:
> > Just now, while running a parallel check-world on HEAD according to the
> > same script I've been using for quite some time, one of the TAP tests
> > died during initdb:
> > 
> > selecting dynamic shared memory implementation ... posix
> > selecting default max_connections ... 100
> > selecting default shared_buffers ... 128MB
> > selecting default timezone ... America/New_York
> > creating configuration files ... ok
> > running bootstrap script ... ok
> > performing post-bootstrap initialization ... 2019-05-08 13:59:19.963 EDT 
> > [18351] FATAL:  pre-existing shared memory block (key 5440004, ID 
> > 1734475802) is still in use
> > 2019-05-08 13:59:19.963 EDT [18351] HINT:  Terminate any old server 
> > processes associated with data directory 
> > "/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata".
> > child process exited with exit code 1
> > initdb: removing data directory 
> > "/home/postgres/pgsql/src/test/subscription/tmp_check/t_004_sync_publisher_data/pgdata"
> > Bail out!  system initdb failed
> > 
> > I have never seen this happen before in the TAP tests.
> > 
> > I think the odds are very high that this implies something wrong with
> > commit c09850992.
> 
> The odds are very high that you would not have gotten that error before that
> commit.  But if the cause matches your guess, it's not something wrong with
> the commit ...
> 
> > My immediate guess after eyeballing that patch quickly is that it was
> > not a good idea to redefine the rules used by bootstrap/standalone
> > backends.  In particular, it seems somewhat plausible that the bootstrap
> > process hadn't yet completely died when the standalone backend for the
> > post-bootstrap phase came along and decided there was a conflict (which
> > it never would have before).
> 
> If so, I would sure try to fix the initdb sequence to not let that happen.  I
> would not trust such a conflict to be harmless.
> 
> What OS, OS version, and filesystem?

PGSharedMemoryCreate shows the error in SHMSTATE_ANALYSYS_FAILURE
case. PGSharedMemoryAttach returns the code when, for example,
shmat failed with ENOMEM. I'm afraid that the message is not
shown from SHMSTATE_ATTACHED..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: POC: Cleaning up orphaned files using undo logs

2019-05-08 Thread Dilip Kumar

On Mon, May 6, 2019 at 5:43 PM Dilip Kumar  wrote:
>
> Just for tracking, open comments which still needs to be worked on.
>
> 1.  Avoid special case in UndoRecordIsValid.
> > Can we instead eliminate the special case?  It seems like the if
> > (log->oldest_data == InvalidUndoRecPtr) case will be taken very
> > rarely, so if it's buggy, we might not notice.

I have worked on this comments and added changes in the latest patch.
>
> 2. While updating the previous transaction header instead of unpacking
> complete header and writing it back, we can just unpack main header
> and calculate the offset of uur_next and then update it directly.

For this as you suggested I am not changing, updated the comments.
>
> 3. unifying uur_xid and uur_xidepoch into uur_fxid.
Still open.

I have also added the README.

Patches can be applied on top of undo branch [1] commit:
(cb777466d008e656f03771cf16ec7ef9d6f2778b)

[1] https://github.com/EnterpriseDB/zheap/tree/undo

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-Update-oldest_data-on-startup_v5.patch
Description: Binary data


0003-Test-module-for-undo-api_v5.patch
Description: Binary data


0004-undo-page-consistency-checker_v5.patch
Description: Binary data


0002-Provide-interfaces-to-store-and-fetch-undo-records_v5.patch
Description: Binary data

Re: integrate Postgres Users Authentication with our own LDAP Server

2019-05-08 Thread Laurenz Albe

On Thu, 2019-05-09 at 04:51 +, M Tarkeshwar Rao wrote:
> We would need to integrate Postgres Users Authentication with our own LDAP 
> Server.  
>  
> Basically as of now we are able to login to Postgress DB with a user/password 
> credential.
>
> [roles "pg_signal_backend" and "postgres"]
>  
> These user objects are the part of Postgres DB server. Now we want that these 
> users should be authenticated by LDAP server.
> We would want the authentication to be done with LDAP, so basically the user 
> credentials should be store in LDAP server
>  
> Can you mention the prescribed steps in Postgres needed for this integration 
> with LDAP Server?

LDAP authentication is well documented:
https://www.postgresql.org/docs/current/auth-ldap.html

But I don't think you are on the right track.

"pg_signal_backend" cannot login, it is a role to which you add a login user
to give it certain privileges.  So you don't need to authenticate the role.

"postgres" is the installation superuser.  If security is important for you,
you won't set a password for that user and you won't allow remote logins
with that user.

But for your application users LDAP authentication is a fine thing, and not
hard to set up if you know a little bit about LDAP.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: pgsql: Add strict_multi_assignment and too_many_rows plpgsql checks

2019-05-08 Thread Peter Eisentraut

On 2018-07-25 01:47, Tomas Vondra wrote:
> Add strict_multi_assignment and too_many_rows plpgsql checks

Could you clarify this error message:

/* translator: %s represents a name of an extra check */
errdetail("%s check of %s is active.",
 "strict_multi_assignment",
 strict_multiassignment_level == ERROR ? "extra_errors" :
 "extra_warnings"),

What does, for example,

strict_multi_assignment check of extra_errors is active.

mean?

Also note that the translator comment is only mentioning one of the %s.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

73 matches

Mail list logo