Documentation and code don't agree about partitioned table UPDATEs

2019-02-05 Thread David Rowley
The docs in PG11 and master both state:

When an UPDATE causes a row to move from one partition to another,
there is a chance that another concurrent UPDATE or DELETE misses this
row. Suppose session 1 is performing an UPDATE on a partition key, and
meanwhile a concurrent session 2 for which this row is visible
performs an UPDATE or DELETE operation on this row. Session 2 can
silently miss the row if the row is deleted from the partition due to
session 1's activity. In such case, session 2's UPDATE or DELETE,
being unaware of the row movement thinks that the row has just been
deleted and concludes that there is nothing to be done for this row.
In the usual case where the table is not partitioned, or where there
is no row movement, session 2 would have identified the newly updated
row and carried out the UPDATE/DELETE on this new row version.


Which was true when it was added by Robert in 2f178441044. However,
f16241bef7c then added code to cause serialization failures when the
update/delete process encountered a moved row.  This seems to work,
going by:

CREATE TABLE listp (a INT, b INT) PARTITION BY LIST (a);
CREATE TABLE listp1 PARTITION OF listp FOR VALUES IN(1);
CREATE TABLE listp2 PARTITION OF listp FOR VALUES IN(2);

INSERT INTO listp VALUES (1, 0);

-- Session 1
BEGIN; SELECT * FROM listp WHERE a=1 FOR UPDATE;


-- Session 2
BEGIN; SELECT * FROM listp WHERE b = 0 FOR UPDATE;

-- Session 1
UPDATE listp SET a = 2 WHERE a = 1; COMMIT;

-- Session 2
ERROR:  tuple to be locked was already moved to another partition due
to concurrent update

So it appears that the documents need to be updated.

I've attached a patch which is my attempt at fixing.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


partition_update_doc_fix.patch
Description: Binary data


Re: Performance issue in foreign-key-aware join estimation

2019-02-05 Thread David Rowley
On Sat, 22 Dec 2018 at 09:28, David Rowley  wrote:
>
> On Fri, 21 Dec 2018 at 06:44, Tom Lane  wrote:
> > I was distressed to discover via perf that 69% of the runtime of this
> > test now goes into match_eclasses_to_foreign_key_col().  That seems
> > clearly unacceptable.
>
> Agreed. That's pretty terrible.
>
> I looked at this a bit and see that
> match_eclasses_to_foreign_key_col() is missing any smarts to skip
> equivalence classes that don't have ec_relids bits for both rels. With
> that added the run-time is reduced pretty dramatically.
>
> I've only tested with a debug build as of now, but I get:
>
> Unpatched:
>
> $ pgbench -n -T 60 -f query.sql postgres
> latency average = 18411.604 ms
>
> Patched:
> latency average = 8748.177 ms

So that this does not get lost, I've added an entry for the original
patch for the March commitfest.

While the patch does not bring the performance back to what it was
before this code was added, it makes a massive dent in the additional
overhead.

https://commitfest.postgresql.org/22/1984/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Performance issue in foreign-key-aware join estimation

2019-02-05 Thread David Rowley
On Tue, 5 Feb 2019 at 22:43, David Rowley  wrote:
> So that this does not get lost, I've added an entry for the original
> patch for the March commitfest.

Attaching the original patch again so the commitfest bot gets off my
back about the other one not applying.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


speed_up_matching_fkeys_to_eclasses.patch
Description: Binary data


Re: WIP: Avoid creation of the free space map for small tables

2019-02-05 Thread John Naylor
On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila  wrote:
>
> On Mon, Feb 4, 2019 at 2:27 PM John Naylor  
> wrote:
> >
> > 1. Earlier, I had a test to ensure that free space towards the front
> > of the relation was visible with no FSM. In [1], I rewrote it without
> > using vacuum, so we can consider adding it back now if desired. I can
> > prepare a patch for this.
> >
>
> Yes, this is required.  It is generally a good practise to add test (unless 
> it takes a lot of time) which covers new code/functionality.
>
> > 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> > we could try putting the fsm.sql test in some existing group in the
> > parallel schedule, rather than its own group is it is now.
> >
>
> +1.

This is done in 0001.

> > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> > patch's effects on GetRecordedFreeSpace(), which will return zero for
> > tables with no FSM.
> >
>
> Right, but what exactly we want to do for it?  Do you want to add a comment 
> atop of this function?

Hmm, the comment already says "according to the FSM", so maybe it's
already obvious. I was thinking more about maybe commenting the
callsite where it's helpful, as in 0002.

> > The other callers are in:
> > contrib/pg_freespacemap/pg_freespacemap.c
> > contrib/pgstattuple/pgstatapprox.c
> >
> > For pg_freespacemap, this doesn't matter, since it's just reporting
> > the facts. For pgstattuple_approx(), it might under-estimate the free
> > space and over-estimate the number of live tuples.
> >
>
> Sure, but without patch also, it can do so, if the vacuum hasn't updated 
> freespace map.

Okay, then maybe we don't need to do anything else here.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From ff821d820630313e4dcea65aa1b35d47a0736c64 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 5 Feb 2019 10:32:19 +0100
Subject: [PATCH v1 1/2] Improve FSM regression test

In the interest of caution, in commit b0eaa4c51bb we left out a
test that used vacuum to remove dead rows. This test has been
rewritten to use fillfactor instead to control free space. Since
we no longer need to remove dead rows as part of the test, put
the fsm regression test in a parallel group.
---
 src/test/regress/expected/fsm.out  | 25 ++---
 src/test/regress/parallel_schedule |  8 +---
 src/test/regress/sql/fsm.sql   | 22 --
 3 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/src/test/regress/expected/fsm.out b/src/test/regress/expected/fsm.out
index b02993188c..d747c1a7dd 100644
--- a/src/test/regress/expected/fsm.out
+++ b/src/test/regress/expected/fsm.out
@@ -2,14 +2,33 @@
 -- Free Space Map test
 --
 CREATE TABLE fsm_check_size (num int, str text);
--- With one block, there should be no FSM
-INSERT INTO fsm_check_size VALUES(1, 'a');
+-- Fill 3 blocks with one record each
+ALTER TABLE fsm_check_size SET (fillfactor=15);
+INSERT INTO fsm_check_size SELECT i, rpad('', 1024, 'a')
+FROM generate_series(1,3) i;
+-- There should be no FSM
 VACUUM fsm_check_size;
 SELECT pg_relation_size('fsm_check_size', 'main') AS heap_size,
 pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  heap_size | fsm_size 
 ---+--
-  8192 |0
+ 24576 |0
+(1 row)
+
+-- Fill most of the last block
+ALTER TABLE fsm_check_size SET (fillfactor=100);
+INSERT INTO fsm_check_size SELECT i, rpad('', 1024, 'a')
+FROM generate_series(101,105) i;
+-- Make sure records can go into any block but the last one
+ALTER TABLE fsm_check_size SET (fillfactor=30);
+-- Insert large record and make sure it does not cause the relation to extend
+INSERT INTO fsm_check_size VALUES (111, rpad('', 1024, 'a'));
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') AS heap_size,
+pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
+ heap_size | fsm_size 
+---+--
+ 24576 |0
 (1 row)
 
 -- Extend table with enough blocks to exceed the FSM threshold
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4051a4ad4e..a956775dd1 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -23,7 +23,7 @@ test: numerology
 # --
 # The second group of parallel tests
 # --
-test: point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 tstypes
+test: point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 tstypes fsm
 
 # --
 # Another group of parallel tests
@@ -68,12 +68,6 @@ test: create_aggregate create_function_3 create_cast constraints triggers inheri
 # --
 test: sanity_check
 
-# --
-# fsm does a delete followed by vacuum, and running it in parallel can prevent
-# removal of rows.
-# --
-test: fsm
-
 # --
 # Bel

Re: Tighten up a few overly lax regexes in pg_dump's tap tests

2019-02-05 Thread Daniel Gustafsson
> On 5 Feb 2019, at 06:55, David G. Johnston  wrote:
> 
> On Monday, February 4, 2019, David Rowley  > wrote:
> On Tue, 5 Feb 2019 at 01:12, Daniel Gustafsson  > wrote:
> > We may also want to use the + metacharacter instead of * in a few places, 
> > since
> > the intent is to always match something, where matching nothing should be
> > considered an error:
> >
> > - qr/^ALTER TEXT SEARCH DICTIONARY dump_test.alt_ts_dict1 
> > OWNER TO .*;/m,
> > + qr/^ALTER TEXT SEARCH DICTIONARY dump_test\.alt_ts_dict1 
> > OWNER TO .*;/m,
> 
> I looked for instances of * alone and didn't see any. I only saw ones
> prefixed with ".", in which case, isn't that matching 1 or more chars
> already?
> 
> No.  In Regex the following are equivalent:
> 
> .* == .{0,}
> .+ == .{1,}
> . == .{1}
> 
> A “*” by itself would either be an error or, assuming the preceding character 
> is a space (so it visually looks alone) would be zero or more consecutive 
> spaces.

Sorry for being a bit unclear in my original email, it’s as David says above:
.* matches zero or more characters and .+ matches 1 or more characters.

> In the above “...OWNER TO;” is a valid match.

Indeed, so we should move to matching with .+ to force an owner.

cheers ./daniel


Re: Online verification of checksums

2019-02-05 Thread Tomas Vondra



On 2/5/19 8:01 AM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> I'm wondering (possibly again) about the existing early exit if one block
> cannot be read on retry: the command should count this as a kind of bad
> block, proceed on checking other files, and obviously fail in the end, but
> having checked everything else and generated a report. I do not think that
> this condition warrants a full stop. ISTM that under rare race conditions
> (eg, an unlucky concurrent "drop database" or "drop table") this could
> happen when online, although I could not trigger one despite heavy 
> testing,
> so I'm possibly mistaken.

 This seems like a defensible judgement call either way.
>>>
>>> Right now we have a few tests that explicitly check that
>>> pg_verify_checksums fail on broken data ("foo" in the file).  Those
>>> would then just get skipped AFAICT, which I think is the worse behaviour
>>> , but if everybody thinks that should be the way to go, we can
>>> drop/adjust those tests and make pg_verify_checksums skip them.
>>>
>>> Thoughts?
>>
>> My point is that it should fail as it does, only not immediately (early
>> exit), but after having checked everything else. This mean avoiding calling
>> "exit(1)" here and there (lseek, fopen...), but taking note that something
>> bad happened, and call exit only in the end.
> 
> I can see both as being valuable (one gives you a more complete picture,
> the other a quicker answer in scripts). For me that's the point where
> it's the prerogative of the author to make that choice.
> 

Why not make this configurable, using a command-line option?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: What happens if checkpoint haven't completed until the next checkpoint interval or max_wal_size?

2019-02-05 Thread Robert Haas
On Mon, Feb 4, 2019 at 2:50 PM Laurenz Albe 
wrote:

> Mohammad Sherafat wrote:
> > In the name of god!
>
> It is not considered good style to hurt people's religious feelings
> by using the name of god in vain.


I agree but...

> What happens if checkpoint haven't completed until the next checkpoint
> > interval or max_wal_size?
>
> Then you have two checkpoints active at the same time.


...not about this part. I think the next checkpoint just doesn't start
until the one already in progress completes.

...Robert

>
> --
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

2019-02-05 Thread Michael Banck
Hi,

Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> On 2/5/19 8:01 AM, Andres Freund wrote:
> > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > I'm wondering (possibly again) about the existing early exit if one 
> > > > > > block
> > > > > > cannot be read on retry: the command should count this as a kind of 
> > > > > > bad
> > > > > > block, proceed on checking other files, and obviously fail in the 
> > > > > > end, but
> > > > > > having checked everything else and generated a report. I do not 
> > > > > > think that
> > > > > > this condition warrants a full stop. ISTM that under rare race 
> > > > > > conditions
> > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this 
> > > > > > could
> > > > > > happen when online, although I could not trigger one despite heavy 
> > > > > > testing,
> > > > > > so I'm possibly mistaken.
> > > > > 
> > > > > This seems like a defensible judgement call either way.
> > > > 
> > > > Right now we have a few tests that explicitly check that
> > > > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > , but if everybody thinks that should be the way to go, we can
> > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > 
> > > > Thoughts?
> > > 
> > > My point is that it should fail as it does, only not immediately (early
> > > exit), but after having checked everything else. This mean avoiding 
> > > calling
> > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > bad happened, and call exit only in the end.
> > 
> > I can see both as being valuable (one gives you a more complete picture,
> > the other a quicker answer in scripts). For me that's the point where
> > it's the prerogative of the author to make that choice.

Personally, I would prefer to keep it as simple as possible for now and
get this patch committed; in my opinion the behaviour is already like
this (early exit on corrupt files) so I don't think the online
verification patch should change this.

If we see complaints about this, then I'd be happy to change it
afterwards.

> Why not make this configurable, using a command-line option?

I like this even less - this tool is about verifying checksums, so
adding options on what to do when it encounters broken pages looks out-
of-scope to me.  Unless we want to say it should generally abort on the
first issue (i.e. on wrong checksums as well).


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Feature: temporary materialized views

2019-02-05 Thread Michael Paquier
Hi Andreas,

On Tue, Feb 05, 2019 at 12:59:12PM +0900, Michael Paquier wrote:
> Now...  You have on this thread all the audience which already worked
> on 874fe3a.  And I am just looking at this patch, evaluating the
> behavior change this is introducing.  Still I would recommend a
> separate thread as others may want to comment on that particular
> point.

So I have read through your patch, and there are a couple of things
which I think we could simplify more.  Here are my notes:
1) We could remove the into clause from DR_intorel, which is used for
two things:
- Determine the relkind of the relation created.  However the relation
gets created before entering in the executor, and we already know its
OID, so we also know its relkind.
- skipData is visibly always false.
We may want to keep skipData to have an assertion at the beginning of
inforel_startup for sanity purposes though.
2) DefineIntoRelForDestReceiver is just a wrapper for
create_ctas_nodata, so we had better just merge both of them and
expose directly the routine creating the relation definition, so the
new interface is a bit awkward.
3) The part about the regression diff is well...  Expected...  We may
want a comment about that.  We could consider as well adding a
regression test inspired from REINDEX SCHEMA to show that the CTAS is
created before the data is actually filled in.
--
Michael


signature.asc
Description: PGP signature


Re: What happens if checkpoint haven't completed until the next checkpoint interval or max_wal_size?

2019-02-05 Thread Michael Paquier
On Tue, Feb 05, 2019 at 04:11:55PM +0530, Robert Haas wrote:
> ...not about this part. I think the next checkpoint just doesn't start
> until the one already in progress completes.

Yes, the requests are communicated from any backends to the
checkpointer with shared memory (see ckpt_flags in RequestCheckpoint),
and the backend signals the checkpointer to do the work, still it
won't do the work until the checkpoint currently running finishes.
--
Michael


signature.asc
Description: PGP signature


Internal error while setting reloption on system catalogs.

2019-02-05 Thread Kyotaro HORIGUCHI
Hello.

The following command complains with an internal
error. (allow_system_table_mods is on).

alter table pg_attribute set (fillfactor = 90);
> ERROR:  AccessExclusiveLock required to add toast table.

The same happens for pg_class. This is because ATRewriteCatalogs
tries to add a toast table to the relations with taking
ShareUpdateExclusiveLock. The decision whether to provide a toast
table for each system catalog is taken in the commit 96cdeae07f
that the two relations won't have a toast relation, so we should
avoid that for all mapped relations.

I didn't find a place where ATTACH PARTITION sends
RELKIND_PARTITIONED_TABLE to the function so the case is omitted
in the patch. Actually the regression test doesn't complain with
the following assertion there in the function.

> Assert(tab->relkind != RELKIND_PARTITIONED_TABLE ||
>tab->partition_constraint == NULL ||
>!tab->check_toast)

Many other commands seem not to need the toast check but the
patch doesn't care about them. It would be another issue.

According to alter_table.sql, it doesn't need a test.

-- XXX: It would be useful to add checks around trying to manipulate
-- catalog tables, but that might have ugly consequences when run
-- against an existing server with allow_system_table_mods = on.

The first attached is for master and 11.
The second is for 10.
The third is for 9.6.

9.4 and 9.5 doesn't suffer the problem but it is because they
create toast table for mapped relations at that time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 35a9ade059..026ee027cd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -159,6 +159,7 @@ typedef struct AlteredTableInfo
 	List	   *newvals;		/* List of NewColumnValue */
 	bool		new_notnull;	/* T if we added new NOT NULL constraints */
 	int			rewrite;		/* Reason for forced rewrite, if any */
+	bool		check_toast;	/* check for toast table */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
 	char		newrelpersistence;	/* if above is true */
@@ -4051,15 +4052,7 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode)
 	{
 		AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 
-		/*
-		 * If the table is source table of ATTACH PARTITION command, we did
-		 * not modify anything about it that will change its toasting
-		 * requirement, so no need to check.
-		 */
-		if (((tab->relkind == RELKIND_RELATION ||
-			  tab->relkind == RELKIND_PARTITIONED_TABLE) &&
-			 tab->partition_constraint == NULL) ||
-			tab->relkind == RELKIND_MATVIEW)
+		if (tab->check_toast)
 			AlterTableCreateToastTable(tab->relid, (Datum) 0, lockmode);
 	}
 }
@@ -4917,6 +4910,15 @@ ATGetQueueEntry(List **wqueue, Relation rel)
 	tab->newrelpersistence = RELPERSISTENCE_PERMANENT;
 	tab->chgPersistence = false;
 
+	/* We don't check for toast table of mapped relations */
+	if ((tab->relkind == RELKIND_MATVIEW ||
+		 tab->relkind == RELKIND_RELATION ||
+		 tab->relkind == RELKIND_PARTITIONED_TABLE) &&
+		!RelationIsMapped(rel))
+		tab->check_toast = true;
+	else
+		tab->check_toast = false;
+
 	*wqueue = lappend(*wqueue, tab);
 
 	return tab;
@@ -14425,6 +14427,14 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 		Assert(tab->partition_constraint == NULL);
 		tab->partition_constraint = (Expr *) linitial(partConstraint);
 		tab->validate_default = validate_default;
+
+		/*
+		 * If the table is source table of ATTACH PARTITION command, we did
+		 * not modify anything about it that will change its toasting
+		 * requirement, so no need to check.
+		 */
+		Assert(tab->partition_constraint != NULL);
+		tab->check_toast = false;
 	}
 	else if (scanrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6f1a429c7e..d238dd78da 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -164,6 +164,7 @@ typedef struct AlteredTableInfo
 	List	   *newvals;		/* List of NewColumnValue */
 	bool		new_notnull;	/* T if we added new NOT NULL constraints */
 	int			rewrite;		/* Reason for forced rewrite, if any */
+	bool		check_toast;	/* check for toast table */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
 	char		newrelpersistence;	/* if above is true */
@@ -3793,15 +3794,7 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode)
 	{
 		AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 
-		/*
-		 * If the table is source table of ATTACH PARTITION command, we did
-		 * not modify anything about it that will change its toasting
-		 * requirement, so no need to check.
-		 */
-		if (((tab->relkind == RELKIND_RELATION ||
-			  tab->relkind == RELKIND_PARTITIONED_TABLE) &&
-			 tab->partition_constraint =

Re: What happens if checkpoint haven't completed until the next checkpoint interval or max_wal_size?

2019-02-05 Thread Kyotaro HORIGUCHI
At Tue, 5 Feb 2019 20:42:59 +0900, Michael Paquier  wrote 
in <20190205114259.gh1...@paquier.xyz>
> On Tue, Feb 05, 2019 at 04:11:55PM +0530, Robert Haas wrote:
> > ...not about this part. I think the next checkpoint just doesn't start
> > until the one already in progress completes.
> 
> Yes, the requests are communicated from any backends to the
> checkpointer with shared memory (see ckpt_flags in RequestCheckpoint),
> and the backend signals the checkpointer to do the work, still it
> won't do the work until the checkpoint currently running finishes.

And, requests for checkpoints are not queued. Multiple checkpoint
requests (both automatically and manually) made during a
checkpoint are finished at once at the end of the running
checkpoint.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Ltree syntax improvement

2019-02-05 Thread Nikolay Shaplov
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry 
Belyavsky написал:

> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.

Hi!

Let's me join the review process...

I've just give a superficial look to the patch, read it through, and try  here 
and there, so first I'll give feedback about exterior features of the patch.

So, the first question: why do we need all this? The answer seems to be 
obvious, but it would be good say it explicitly. What problems would it solve? 
Do these problems really exists? 

The second question is about coding style. As far as I can see you've passed 
your code through pg_indent, but it does not solve all problems.

As far as I know, postgres commiters are quite strict about code width, the 
code should not be wider that 80 col, except places were string constants are 
introduced (There you can legally exceed 80 col). So I would advice to use vim 
with  tabstop=4  and colorcolumn=80 and make sure that new code text does not 
cross red vertical line. 

Comments. There is no fixed coding style for comments in postgres. Usually this 
means that it is better to follow coding in the code around. But ltree does 
not have much comments, so I would suggest to use the best style I've seen in 
the code
/*
 * [function name]
 * < tab > [Short description, what does it do]
 * 
 * [long explanation how it works, several subparagraph if needed
 */
(Seen this in src/include/utils/rel.h)
or something like that.
At least it would be good to have explicit separation between descriptions and 
explanation of how it works.

And about comment
/*
 * Function calculating bytes to escape
 * to_escape is an array of "special" 1-byte symbols
 * Behvaiour:
 * If there is no "special" symbols, return 0
 * If there are any special symbol, we need initial and final quote, so return 
2
 * If there are any quotes, we need to escape all of them and also initial and 
final quote, so
 * return 2 + number of quotes
 */

It has some formatting. But it should not.  As far as I understand this 
comment should be treated as single paragraph by reformatter, and refomatted. 
I do not understand why pg_indent have not done it. Documentation (https://
www.postgresql.org/docs/11/source-format.html) claims that if you want to 
avoid reformatting, you should add "-" at the beginning and at the 
end of the comment. So it would be good either remove this formatting, or add 
"" if you are sure you want to keep it.

Third part is about error messages.
First I've seen errors like elog(ERROR, "internal error during splitting 
levels");. I've never seen error like that in postgres. May be if this error 
is in the part of the code, that should never be reached in real live, may be 
it would be good to change it to Assert? Or if this code can be normally 
reached, add some more explanation of what had happened...

About other error messages.

There are messages like 
errmsg("syntax error"),
errdetail("Unexpected end of line.")));

or

errmsg("escaping syntax error")));


In postgres there is error message style guide 
https://www.postgresql.org/docs/11/error-style-guide.html

Here I would expect messages like that

Primary:Error parsing ltree path
Detail: Curly quote is opened and not closed

Or something like that, I am not expert in English, but this way it would be 
better for sure. Key points are: Primary message should point to general area 
where error occurred (there can be a large SQL with a lot of things in it, so 
it is good to point that problem is with ltree staff). And is is better to 
word from the point of view of a user. What for you (and me) is unexpected end 
of line, for him it is unclosed curly quote. 

Also I am not sure, if it is easy, but if it is possible, it would be good to 
move "^" symbol that points to the place of the error, to the place inside ' ' 
where unclosed curly quote is located. As a user I would appreciate that.

So that's what I've seen while giving it superficial look...




Re: Memory contexts reset for trigger invocations

2019-02-05 Thread Tom Lane
Andres Freund  writes:
> Wouldn't it be better to reset an appropriate context after each
> invocation? Yes, that'd require some care to manage the lifetime of
> tuples returned by triggers, but that seems OK?

I'm not sure that we can change much here without effectively breaking
the trigger-function API.  I doubt that there's enough win to be had
to justify forcing external PLs etc to change.

Having said that, I recall that that API was kind of a pain in the
rear when I was redoing plpgsql's handling of composite variables
last year.  Don't recall details right now (ENOCAFFEINE).  Maybe
a wholesale rethink would be justified?  But I'm not excited about
just twiddling things at the margin.

regards, tom lane



Re: Memory contexts reset for trigger invocations

2019-02-05 Thread Andres Freund
Hi,

On February 5, 2019 7:55:28 PM GMT+05:30, Tom Lane  wrote:
>Andres Freund  writes:
>> Wouldn't it be better to reset an appropriate context after each
>> invocation? Yes, that'd require some care to manage the lifetime of
>> tuples returned by triggers, but that seems OK?
>
>I'm not sure that we can change much here without effectively breaking
>the trigger-function API.

HM, why? If we copy the tuple into a longer lived context we ought to be able 
to reset more aggressively.

  I doubt that there's enough win to be had
>to justify forcing external PLs etc to change.
>
>Having said that, I recall that that API was kind of a pain in the
>rear when I was redoing plpgsql's handling of composite variables
>last year.  Don't recall details right now (ENOCAFFEINE).  Maybe
>a wholesale rethink would be justified?  But I'm not excited about
>just twiddling things at the margin.

Yea, we probably ought to: The pluggable storage patch set makes trigger.c use 
slots. But the trigger interface requires heap tuples, so we extract heap 
tuples. But probably it'd be go full in slots at some point.

Regards,


Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Internal error while setting reloption on system catalogs.

2019-02-05 Thread Tom Lane
Kyotaro HORIGUCHI  writes:
> The following command complains with an internal
> error. (allow_system_table_mods is on).

> alter table pg_attribute set (fillfactor = 90);
>> ERROR:  AccessExclusiveLock required to add toast table.

> The same happens for pg_class.

Isn't this more or less the same thing Peter E. was attempting to fix in
https://commitfest.postgresql.org/22/1764/
?

> The first attached is for master and 11.
> The second is for 10.
> The third is for 9.6.

It's not clear to me that we ought to consider catalog alterations
supported at all, but in any case I wouldn't treat it as something
to back-patch.

regards, tom lane



Re: fast defaults in heap_getattr vs heap_deform_tuple

2019-02-05 Thread Andrew Dunstan


On 2/1/19 5:49 PM, Andres Freund wrote:
> Hi,
>
> On 2019-02-01 08:24:04 -0800, Andres Freund wrote:
>
> Andrew, I think it'd be good to do a ground up review of the fast
> defaults patch. There's been a fair number of issues in it, and this is
> a another pretty fundamental issue.
>


OK. Will try to organise it.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Fix optimization of foreign-key on update actions

2019-02-05 Thread Peter Eisentraut
I came across an edge case in how our foreign key implementation works
that I think is not SQL conforming.  It has to do with how updates to
values that "look" different but compare as equal are cascaded.  A
simple case involves float -0 vs. 0, but relevant cases also arise with
citext and case-insensitive collations.

Consider this example:

create table pktable2 (a float8, b float8, primary key (a, b));
create table fktable2 (x float8, y float8,
foreign key (x, y) references pktable2 (a, b) on update cascade);

insert into pktable2 values ('-0', '-0');
insert into fktable2 values ('-0', '-0');

update pktable2 set a = '0' where a = '-0';

What happens now?

select * from pktable2;
 a | b
---+
 0 | -0
(1 row)

-- buggy: did not update fktable2.x
select * from fktable2;
 x  | y
+
 -0 | -0
(1 row)

This happens because ri_KeysEqual() compares the old and new rows and
decides that because they are "equal", the ON UPDATE actions don't need
to be run.

The SQL standard seems clear that ON CASCADE UPDATE means that an
analogous UPDATE should be run on matching rows in the foreign key
table, so the current behavior is wrong.

Moreover, if another column is also updated, like update pktable2 set a
= '0', b = '5', then the old and new rows are no longer equal, and so x
will get updated in fktable2.  So the context creates inconsistencies.

The fix is that in these cases we have ri_KeysEqual() use a more
low-level form of equality, like for example record_image_eq does.  In
fact, we can take the same code.  See attached patches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 8193fe1560c9e7aaa48c039fb3c85ab93558d596 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 5 Feb 2019 16:27:00 +0100
Subject: [PATCH 1/2] Test case for keys that "look" different but compare as
 equal

---
 src/test/regress/expected/foreign_key.out | 34 +++
 src/test/regress/sql/foreign_key.sql  | 20 +
 2 files changed, 54 insertions(+)

diff --git a/src/test/regress/expected/foreign_key.out 
b/src/test/regress/expected/foreign_key.out
index bf2c91d9f0..a68d055e40 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -1474,6 +1474,40 @@ delete from pktable2 where f1 = 1;
 alter table fktable2 drop constraint fktable2_f1_fkey;
 ERROR:  cannot ALTER TABLE "pktable2" because it has pending trigger events
 commit;
+drop table pktable2, fktable2;
+--
+-- Test keys that "look" different but compare as equal
+--
+create table pktable2 (a float8, b float8, primary key (a, b));
+create table fktable2 (x float8, y float8, foreign key (x, y) references 
pktable2 (a, b) on update cascade);
+insert into pktable2 values ('-0', '-0');
+insert into fktable2 values ('-0', '-0');
+select * from pktable2;
+ a  | b  
++
+ -0 | -0
+(1 row)
+
+select * from fktable2;
+ x  | y  
++
+ -0 | -0
+(1 row)
+
+update pktable2 set a = '0' where a = '-0';
+select * from pktable2;
+ a | b  
+---+
+ 0 | -0
+(1 row)
+
+-- buggy: did not update fktable2.x
+select * from fktable2;
+ x  | y  
++
+ -0 | -0
+(1 row)
+
 drop table pktable2, fktable2;
 --
 -- Foreign keys and partitioned tables
diff --git a/src/test/regress/sql/foreign_key.sql 
b/src/test/regress/sql/foreign_key.sql
index c8d1214d02..c1fc7d30b1 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -1106,6 +1106,26 @@ CREATE TEMP TABLE tasks (
 
 drop table pktable2, fktable2;
 
+--
+-- Test keys that "look" different but compare as equal
+--
+create table pktable2 (a float8, b float8, primary key (a, b));
+create table fktable2 (x float8, y float8, foreign key (x, y) references 
pktable2 (a, b) on update cascade);
+
+insert into pktable2 values ('-0', '-0');
+insert into fktable2 values ('-0', '-0');
+
+select * from pktable2;
+select * from fktable2;
+
+update pktable2 set a = '0' where a = '-0';
+
+select * from pktable2;
+-- buggy: did not update fktable2.x
+select * from fktable2;
+
+drop table pktable2, fktable2;
+
 
 --
 -- Foreign keys and partitioned tables

base-commit: 24114e8b4df4a4ff2db9e608742768d219b1067c
-- 
2.20.1

From 083363ef132829bbf6b990317253868b5e48a137 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 5 Feb 2019 16:31:33 +0100
Subject: [PATCH 2/2] Fix optimization of foreign-key on update actions

In RI_FKey_pk_upd_check_required(), we check among other things
whether the old and new key are equal, so that we don't need to run
cascade actions when nothing has actually changed.  This was using the
equality operator.  But the effect of this is that if a value in the
primary key is changed to one that "looks" different but compares as
equal, the update is not propagated.  (Examples are float -0 and 0 and
case-insensitive text.)  This appears to violate the SQL standard, and
it also behaves inconsistently if in a multic

Re: Fix optimization of foreign-key on update actions

2019-02-05 Thread Andrew Gierth
> "Peter" == Peter Eisentraut  writes:

 Peter> The SQL standard seems clear

ha

hahaha

HAHAHAHAHAHA

(since when has the SQL standard ever been clear?)

SQL2016, 15.17 Execution of referential actions

10) If a non-null value of a referenced column RC in the referenced
table is updated to a value that is distinct from the current value
of RC, then,

  [snip all the stuff about how ON UPDATE actions work]

does that "is distinct from" mean that IS DISTINCT FROM would be true,
or does it mean "is in some way distinguishable from"? Nothing I can see
in the spec suggests the latter.

-- 
Andrew (irc:RhodiumToad)



Re: Fix optimization of foreign-key on update actions

2019-02-05 Thread Tom Lane
Andrew Gierth  writes:
> "Peter" == Peter Eisentraut  writes:
>  Peter> The SQL standard seems clear

> (since when has the SQL standard ever been clear?)

Point to Andrew ;-).  However, I kind of like Peter's idea anyway
on the grounds that byte-wise comparison is probably faster than
invoking the datatype comparators.  Also, language lawyering aside,
I think he may be right about what people would expect "should" happen.

What I *don't* like about the proposed patch is that it installs a
new, different comparison rule for the ON UPDATE CASCADE case only.
If we were to go in this direction, I'd think we should try to use
the same comparison rule for all FK row comparisons.

The inconsistencies get messier the more you think about it,
really.  If a referencing row was datatype-equal, but not byte-equal,
to the PK row to start with, why would an update changing the PK row
(perhaps to another datatype-equal value) result in forcing the
referencing row to become byte-equal?  How does this fit in with
the fact that our notion of what uniqueness means in the PK table
is no-datatype-equality, rather than no-byte-equality?

On the whole we might be better off leaving this alone.

regards, tom lane



Re: pg11.1: dsa_area could not attach to segment

2019-02-05 Thread Justin Pryzby
I finally reproduced this with core..

For some reason I needed to write assert() rather than elog(PANIC), otherwise
it failed with ERROR and no core..

@@ -1741,4 +1743,5 @@ get_segment_by_index(dsa_area *area, dsa_segment_index 
index)
segment = dsm_attach(handle);
+   assert (segment != NULL);
if (segment == NULL)
-   elog(ERROR, "dsa_area could not attach to segment");
+   elog(PANIC, "dsa_area could not attach to segment");
if (area->mapping_pinned)

On Mon, Dec 03, 2018 at 11:45:00AM +1300, Thomas Munro wrote:   

  
> If anyone can reproduce this problem with a debugger, it'd be 
>   
> 
> interesting to see the output of dsa_dump(area), and  
>   
> 
> FreePageManagerDump(segment_map->fpm).

Looks like this will take some work, is it ok if I make a coredump available to
you ?  I'm not sure how sensitive it is to re/compilation, but I'm using PG11.1
compiled locally on centos6.

/var/log/postgresql/postgresql-2019-02-05_111730.log-< 2019-02-05 11:17:31.372 
EST  >LOG:  background worker "parallel worker" (PID 17110) was terminated by 
signal 6: Aborted
/var/log/postgresql/postgresql-2019-02-05_111730.log:< 2019-02-05 11:17:31.372 
EST  >DETAIL:  Failed process was running: SELECT colcld.child c, parent p, 
array_agg(colpar.attname::text ORDER BY colpar.attnum) cols, 
array_agg(format_type(colpar.atttypid, colpar.atttypmod) ORDER BY 
colpar.attnum) AS types FROM queued_alters qa JOIN pg_attribute colpar ON 
to_regclass(qa.parent)=colpar.attrelid AND colpar.attnum>0 AND NOT 
colpar.attisdropped JOIN (SELECT *, attrelid::regclass::text AS child FROM 
pg_attribute) colcld ON to_regclass(qa.child) =colcld.attrelid AND 
colcld.attnum>0 AND NOT colcld.attisdropped WHERE colcld.attname=colpar.attname 
AND colpar.atttypid!=colcld.atttypid GROUP BY 1,2 ORDER BY parent LIKE 
'unused%', regexp_replace(colcld.child, 
'.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$', '\3\5') DESC, 
regexp_replace(colcld.child, '.*_', '') DESC LIMIT 1

(gdb) bt
#0  0x0037b9c32495 in raise () from /lib64/libc.so.6
#1  0x0037b9c33c75 in abort () from /lib64/libc.so.6
#2  0x0037b9c2b60e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0037b9c2b6d0 in __assert_fail () from /lib64/libc.so.6
#4  0x008c4a72 in get_segment_by_index (area=0x2788440, index=) at dsa.c:1744
#5  0x008c58e9 in get_best_segment (area=0x2788440, npages=8) at 
dsa.c:1995
#6  0x008c6c99 in dsa_allocate_extended (area=0x2788440, size=32768, 
flags=0) at dsa.c:703
#7  0x0064c6fe in ExecParallelHashTupleAlloc (hashtable=0x27affb0, 
size=104, shared=0x7ffc6b5cfc48) at nodeHash.c:2837
#8  0x0064cb92 in ExecParallelHashTableInsert (hashtable=0x27affb0, 
slot=, hashvalue=423104953) at nodeHash.c:1693
#9  0x0064cf17 in MultiExecParallelHash (node=0x27a1ed8) at 
nodeHash.c:288
#10 MultiExecHash (node=0x27a1ed8) at nodeHash.c:112
#11 0x0064e1f8 in ExecHashJoinImpl (pstate=0x2793038) at 
nodeHashjoin.c:290
#12 ExecParallelHashJoin (pstate=0x2793038) at nodeHashjoin.c:581
#13 0x00638ce0 in ExecProcNodeInstr (node=0x2793038) at 
execProcnode.c:461
#14 0x006349c7 in ExecProcNode (queryDesc=0x2782cd0, direction=, count=0, execute_once=56) at 
../../../src/include/executor/executor.h:237
#15 ExecutePlan (queryDesc=0x2782cd0, direction=, count=0, 
execute_once=56) at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2782cd0, direction=, 
count=0, execute_once=56) at execMain.c:364
#17 0x7f84a97c8618 in pgss_ExecutorRun (queryDesc=0x2782cd0, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
pg_stat_statements.c:892
#18 0x7f84a93357dd in explain_ExecutorRun (queryDesc=0x2782cd0, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
auto_explain.c:268
#19 0x00635071 in ParallelQueryMain (seg=0x268fba8, toc=0x7f84a9578000) 
at execParallel.c:1402
#20 0x00508f34 in ParallelWorkerMain (main_arg=) 
at parallel.c:1409
#21 0x00704760 in StartBackgroundWorker () at bgworker.c:834
#22 0x0070e11c in do_start_bgworker () at postmaster.c:5698
#23 maybe_start_bgworkers () at postmaster.c:5911
#24 0x00710786 in sigusr1_handler (postgres_signal_arg=) at postmaster.c:5091
#25 
#26 0x0037b9ce1603 in __select_nocancel () from /lib64/libc.so.6
#27 0x0071300e in ServerLoop (argc=, argv=) at postmaster.c:1670
#28 PostmasterMain (argc=, argv=) at 
postmaster.c:1379
#29 0x0067e8c0 in main (argc=3, argv=0x265f960) at main.c:228

#0  0x0

Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Bruce Momjian
On Mon, Feb  4, 2019 at 06:34:50AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
> > As per $subject, CF 2019-01 is now closed for business.  Here is the final
> > score:
> > Committed: 58.
> > Moved to next CF: 113.
> > Withdrawn: 4.
> > Rejected: 3.
> > Returned with Feedback: 52.
> > Total: 230.
> 
> Wow, thank you so much for your hard work.  The last CF for PG 12 should be 
> tough...

Agreed.  I am somewhat concerned about this and am wondering what we can
do now to limit problems.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Tom Lane
Bruce Momjian  writes:
> On Mon, Feb  4, 2019 at 06:34:50AM +, Tsunakawa, Takayuki wrote:
>> Wow, thank you so much for your hard work.  The last CF for PG 12 should be 
>> tough...

> Agreed.  I am somewhat concerned about this and am wondering what we can
> do now to limit problems.

There's been talk periodically of having an aggressive triage effort
to try to sort through the pending patches and decide which ones have
no hope of making it to commit in the last CF.  Then, if we just push
those off to the next cycle immediately and focus our attention on the
ones that do have a credible chance, we might get more of the latter
ones done.

regards, tom lane



Re: fast defaults in heap_getattr vs heap_deform_tuple

2019-02-05 Thread Andres Freund
Hi,

On 2019-02-05 10:14:48 -0500, Andrew Dunstan wrote:
> 
> On 2/1/19 5:49 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2019-02-01 08:24:04 -0800, Andres Freund wrote:
> >
> > Andrew, I think it'd be good to do a ground up review of the fast
> > defaults patch. There's been a fair number of issues in it, and this is
> > a another pretty fundamental issue.
> >
> 
> 
> OK. Will try to organise it.

Cool. Here's the patch that I, after some commit message polishing, plan
to commit to 11 and master to fix the issue at hand.

Greetings,

Andres Freund
>From 5fef71e1ffa0ba6a12ca9db6e286221750f53dc2 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 4 Feb 2019 07:10:12 -0800
Subject: [PATCH] Fix heap_getattr() handling of fast defaults.

Previously heap_getattr() returned NULL for attributes with a fast
default value (c.f. 16828d5c0273), as it had no handling whatsoever
for that case.

A previous fix, 7636e5c60f, attempted to fix issues caused by this
oversight, but just expanding OLD tuples for triggers doesn't actually
solve the actual issue.

One known consequence of this bug is that the check for HOT updates
can return the wrong result, when a previously fast-default'ed column
is set to NULL.

Fix by handling heap_getattr(), remove now superfluous expansion in
GetTupleForTrigger().

Author: Andres Freund
Discussion: https://postgr.es/m/20190201162404.onngi77f26bae...@alap3.anarazel.de
Backpatch: 11, where fast defaults were introduced
---
 src/backend/access/common/heaptuple.c  |  4 +---
 src/backend/commands/trigger.c |  5 +---
 src/include/access/htup_details.h  |  7 +++---
 src/test/regress/expected/fast_default.out | 27 ++
 src/test/regress/sql/fast_default.sql  | 20 
 5 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index ed4549ca579..783b04a3cb9 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -71,8 +71,6 @@
 #define VARLENA_ATT_IS_PACKABLE(att) \
 	((att)->attstorage != 'p')
 
-static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull);
-
 
 /* 
  *		misc support routines
@@ -82,7 +80,7 @@ static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull);
 /*
  * Return the missing value of an attribute, or NULL if there isn't one.
  */
-static Datum
+Datum
 getmissingattr(TupleDesc tupleDesc,
 			   int attnum, bool *isnull)
 {
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 7b5896b98f9..0b245a613e0 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3412,10 +3412,7 @@ ltrmark:;
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 	}
 
-	if (HeapTupleHeaderGetNatts(tuple.t_data) < relation->rd_att->natts)
-		result = heap_expand_tuple(&tuple, relation->rd_att);
-	else
-		result = heap_copytuple(&tuple);
+	result = heap_copytuple(&tuple);
 	ReleaseBuffer(buffer);
 
 	return result;
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 873a3015fc4..d2b46456f1c 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -764,10 +764,7 @@ extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 		((attnum) > 0) ? \
 		( \
 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
-			( \
-(*(isnull) = true), \
-(Datum)NULL \
-			) \
+getmissingattr(tupleDesc, attnum, isnull) \
 			: \
 fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
 		) \
@@ -788,6 +785,8 @@ extern Datum nocachegetattr(HeapTuple tup, int attnum,
 			   TupleDesc att);
 extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 bool *isnull);
+extern Datum getmissingattr(TupleDesc tupleDesc,
+			   int attnum, bool *isnull);
 extern HeapTuple heap_copytuple(HeapTuple tuple);
 extern void heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest);
 extern Datum heap_copy_tuple_as_datum(HeapTuple tuple, TupleDesc tupleDesc);
diff --git a/src/test/regress/expected/fast_default.out b/src/test/regress/expected/fast_default.out
index 40a15bd2d61..10bc5ff757c 100644
--- a/src/test/regress/expected/fast_default.out
+++ b/src/test/regress/expected/fast_default.out
@@ -770,6 +770,33 @@ SELECT * FROM vtype2;
  2 | yyy
 (2 rows)
 
+-- Ensure that defaults are checked when evaluating whether HOT update
+-- is possible, this was broken for a while:
+-- https://postgr.es/m/20190202133521.ylauh3ckqa7colzj%40alap3.anarazel.de
+BEGIN;
+CREATE TABLE t();
+INSERT INTO t DEFAULT VALUES;
+ALTER TABLE t ADD COLUMN a int DEFAULT 1;
+CREATE INDEX ON t(a);
+-- set column with a default 1 to NULL, due to a bug that wasn't
+-- noticed has heap_getattr buggily returned NULL for default columns
+UPDATE t SET a = NULL;
+-- verify that index and non-index scans show the same result

Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Bruce Momjian
On Tue, Feb  5, 2019 at 11:55:37AM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Feb  4, 2019 at 06:34:50AM +, Tsunakawa, Takayuki wrote:
> >> Wow, thank you so much for your hard work.  The last CF for PG 12 should 
> >> be tough...
> 
> > Agreed.  I am somewhat concerned about this and am wondering what we can
> > do now to limit problems.
> 
> There's been talk periodically of having an aggressive triage effort
> to try to sort through the pending patches and decide which ones have
> no hope of making it to commit in the last CF.  Then, if we just push
> those off to the next cycle immediately and focus our attention on the
> ones that do have a credible chance, we might get more of the latter
> ones done.

The ones I am really worried about are the ones that keep getting
delayed, e.g., CTE inlining, online checksums, multi-variate statistics.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Tom Lane
Bruce Momjian  writes:
> On Tue, Feb  5, 2019 at 11:55:37AM -0500, Tom Lane wrote:
>> There's been talk periodically of having an aggressive triage effort
>> to try to sort through the pending patches and decide which ones have
>> no hope of making it to commit in the last CF.  Then, if we just push
>> those off to the next cycle immediately and focus our attention on the
>> ones that do have a credible chance, we might get more of the latter
>> ones done.

> The ones I am really worried about are the ones that keep getting
> delayed, e.g., CTE inlining, online checksums, multi-variate statistics.

The only thing keeping me from committing CTE inlining today is doubt
about whether we have consensus on the syntax.  Those other two I don't
know the status of.

regards, tom lane



Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Bruce Momjian
On Tue, Feb  5, 2019 at 12:02:23PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Tue, Feb  5, 2019 at 11:55:37AM -0500, Tom Lane wrote:
> >> There's been talk periodically of having an aggressive triage effort
> >> to try to sort through the pending patches and decide which ones have
> >> no hope of making it to commit in the last CF.  Then, if we just push
> >> those off to the next cycle immediately and focus our attention on the
> >> ones that do have a credible chance, we might get more of the latter
> >> ones done.
> 
> > The ones I am really worried about are the ones that keep getting
> > delayed, e.g., CTE inlining, online checksums, multi-variate statistics.
> 
> The only thing keeping me from committing CTE inlining today is doubt
> about whether we have consensus on the syntax.  Those other two I don't
> know the status of.

I am thinking we should see which items we really want for PG 12 _now_
and allocate resources/help to get them done, rather than being
surprised they didn't make it.  I am glad we are in good shape with
CTEs, since that has been a long-requested feature.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Andres Freund
Hi,

On 2019-02-05 11:55:37 -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Feb  4, 2019 at 06:34:50AM +, Tsunakawa, Takayuki wrote:
> >> Wow, thank you so much for your hard work.  The last CF for PG 12 should 
> >> be tough...
> 
> > Agreed.  I am somewhat concerned about this and am wondering what we can
> > do now to limit problems.
> 
> There's been talk periodically of having an aggressive triage effort
> to try to sort through the pending patches and decide which ones have
> no hope of making it to commit in the last CF.  Then, if we just push
> those off to the next cycle immediately and focus our attention on the
> ones that do have a credible chance, we might get more of the latter
> ones done.

I'm planning to do a pass like I did for v11's last CF before the start
of it. That still requires others to chime in, my opinion alone
shouldn't be the sole deciding factor...

What we'd talked about briefly at the Fosdem dev meeting was that a
field 'target release' or 'target branch' would be very useful to be
able to focus attention more. There's plenty stuff in the current CF
that is getting some attention, and deserves a bit more, but that
clearly isn't aimed for v12. Being able to filter by that would be huge.

Magnus has said he'd try creating something like that.

Greetings,

Andres Freund



Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Andres Freund
Hi,

On 2019-02-05 12:02:23 -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Tue, Feb  5, 2019 at 11:55:37AM -0500, Tom Lane wrote:
> >> There's been talk periodically of having an aggressive triage effort
> >> to try to sort through the pending patches and decide which ones have
> >> no hope of making it to commit in the last CF.  Then, if we just push
> >> those off to the next cycle immediately and focus our attention on the
> >> ones that do have a credible chance, we might get more of the latter
> >> ones done.
> 
> > The ones I am really worried about are the ones that keep getting
> > delayed, e.g., CTE inlining, online checksums, multi-variate statistics.
> 
> The only thing keeping me from committing CTE inlining today is doubt
> about whether we have consensus on the syntax.  Those other two I don't
> know the status of.

Online checksums has been punted by Magnus, because he didn't have time
to work on it so far. I'd provided him with a prototype of the necessary
infrastructure piece.

Greetings,

Andres Freund



Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Tomas Vondra
Hi,

I find it a bit surprising there are almost no results demonstrating the
impact of the proposed changes on some typical workloads. It touches
code (syscache, ...) that is quite sensitive performance-wise, and
adding even just a little bit of overhead may hurt significantly. Even
on systems that don't have issues with cache bloat, etc.

I think this is something we need - benchmarks measuring the overhead on
a bunch of workloads (both typical and corner cases). Especially when
there was a limit on cache size in the past, and it was removed because
it was too expensive / hurting in some cases. I can't imagine committing
any such changes without this information.

This is particularly important as the patch was about one particular
issue (bloat due to negative entries) initially, but then the scope grew
quite a it. AFAICS the thread now talks about these workloads:

* negative entries (due to search_path lookups etc.)
* many tables accessed randomly
* many tables with only a small subset accessed frequently
* many tables with subsets accessed in subsets (due to pooling)
* ...

Unfortunately, some of those cases seems somewhat contradictory (i.e.
what works for one hurts the other), so I doubt it's possible to improve
all of them at once. But that makes the bencharking even more important.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Feature: temporary materialized views

2019-02-05 Thread Andreas Karlsson
On 2/5/19 12:36 PM, Michael Paquier wrote:> - skipData is visibly always 
false.

> We may want to keep skipData to have an assertion at the beginning of
> inforel_startup for sanity purposes though.
This is not true in this version of the patch. The following two cases 
would crash if we add such an assertion:


EXPLAIN ANALYZE CREATE TABLE foo AS SELECT 1 WITH NO DATA;

and

PREPARE s AS SELECT 1;
CREATE TABLE bar AS EXECUTE s WITH NO DATA;

since they both still run the setup and tear down steps of the executor.

I guess that I could fix that for the second case as soon as I 
understand how much of the portal stuff can be skipped in 
ExecuteQuery(). But I am not sure what we should do with EXPLAIN ANALYZE 
... NO DATA. It feels like a contraction to me. Should we just raise an 
error? Or should we try to preserve the current behavior where you see 
something like the below?


QUERY PLAN
---
 Result  (cost=0.00..0.01 rows=1 width=4) (never executed)
 Planning Time: 0.040 ms
 Execution Time: 0.002 ms
(3 rows)

> 2) DefineIntoRelForDestReceiver is just a wrapper for
> create_ctas_nodata, so we had better just merge both of them and
> expose directly the routine creating the relation definition, so the
> new interface is a bit awkward.
Agreed, the API is awakward as it is now but it was the least awkward 
one I managed to design. But I think if we fix the issue above then it 
might be possible to create a less awkward API.


> 3) The part about the regression diff is well...  Expected...  We may
> want a comment about that.  We could consider as well adding a
> regression test inspired from REINDEX SCHEMA to show that the CTAS is
> created before the data is actually filled in.
Yeah, that sounds like a good idea.

Andreas



Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Tomas Vondra
On 1/21/19 9:56 PM, Bruce Momjian wrote:
> On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote:
>> Hi,
>>
>> On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
>>> On Fri, Jan 18, 2019 at 4:23 PM and...@anarazel.de  
>>> wrote:
 My proposal for this was to attach a 'generation' to cache entries. Upon
 access cache entries are marked to be of the current
 generation. Whenever existing memory isn't sufficient for further cache
 entries and, on a less frequent schedule, triggered by a timer, the
 cache generation is increased and th new generation's "creation time" is
 measured.  Then generations that are older than a certain threshold are
 purged, and if there are any, the entries of the purged generation are
 removed from the caches using a sequential scan through the cache.

 This outline achieves:
 - no additional time measurements in hot code paths
 - no need for a sequential scan of the entire cache when no generations
   are too old
 - both size and time limits can be implemented reasonably cheaply
 - overhead when feature disabled should be close to zero
>>>
>>> Seems generally reasonable.  The "whenever existing memory isn't
>>> sufficient for further cache entries" part I'm not sure about.
>>> Couldn't that trigger very frequently and prevent necessary cache size
>>> growth?
>>
>> I'm thinking it'd just trigger a new generation, with it's associated
>> "creation" time (which is cheap to acquire in comparison to creating a
>> number of cache entries) . Depending on settings or just code policy we
>> can decide up to which generation to prune the cache, using that
>> creation time.  I'd imagine that we'd have some default cache-pruning
>> time in the minutes, and for workloads where relevant one can make
>> sizing configurations more aggressive - or something like that.
> 
> OK, so it seems everyone likes the idea of a timer.  The open questions
> are whether we want multiple epochs, and whether we want some kind of
> size trigger.
> 

FWIW I share the with that time-based eviction (be it some sort of
timestamp or epoch) seems promising, seems cheaper than pretty much any
other LRU metric (requiring usage count / clock sweep / ...).

> With only one time epoch, if the timer is 10 minutes, you could expire an
> entry after 10-19 minutes, while with a new epoch every minute and
> 10-minute expire, you can do 10-11 minute precision.  I am not sure the
> complexity is worth it.
> 

I don't think having just a single epoch would be significantly less
complex than having more of them. In fact, having more of them might
make it actually cheaper.


> For a size trigger, should removal be effected by how many expired cache
> entries there are?  If there were 10k expired entries or 50, wouldn't
> you want them removed if they have not been accessed in X minutes?
> 
> In the worst case, if 10k entries were accessed in a query and never
> accessed again, what would the ideal cleanup behavior be?  Would it
> matter if it was expired in 10 or 19 minutes?  Would it matter if there
> were only 50 entries?
> 

I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.

FWIW when it comes to memory consumption, it's important to realize the
cache memory context won't release the memory to the system, even if we
remove the expired entries. It'll simply stash them into a freelist.
That's OK when the entries are to be reused, but the memory usage won't
decrease after a sudden spike for example (and there may be other chunks
allocated on the same page, so paging it out will hurt).

So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg11.1: dsa_area could not attach to segment

2019-02-05 Thread Justin Pryzby
And here's the "dsa_allocate could not find %zu free pages" error with core.

@@ -726,5 +728,5 @@ dsa_allocate_extended(dsa_area *area, size_t size, int 
flags)
 */
-   if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
-   elog(FATAL,
-"dsa_allocate could not find %zu free pages", 
npages);
+   assert (FreePageManagerGet(segment_map->fpm, npages, 
&first_page));
+
+   // if (!FreePageManagerGet(segment_map->fpm, npages, 
&first_page)) elog(PANIC, "dsa_allocate could not find %zu free pages", npages);
LWLockRelease(DSA_AREA_LOCK(area));

< 2019-02-05 13:23:29.137 EST  >LOG:  background worker "parallel worker" (PID 
7140) was terminated by signal 6: Aborted
< 2019-02-05 13:23:29.137 EST  >DETAIL:  Failed process was running: explain 
analyze SELECT * FROM eric_enodeb_metrics WHERE start_time>='2017-10-01' AND 
(site_id<1900 OR site_id>2700)

#0  0x0037b9c32495 in raise () from /lib64/libc.so.6
#1  0x0037b9c33c75 in abort () from /lib64/libc.so.6
#2  0x0037b9c2b60e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0037b9c2b6d0 in __assert_fail () from /lib64/libc.so.6
#4  0x008c6f74 in dsa_allocate_extended (area=0x27c05e0, size=393220, 
flags=5) at dsa.c:729
#5  0x0068521f in pagetable_allocate (pagetable=, 
size=) at tidbitmap.c:1511
#6  0x006876d2 in pagetable_grow (tbm=0x7f84a8d86a58, pageno=1635) at 
../../../src/include/lib/simplehash.h:383
#7  pagetable_insert (tbm=0x7f84a8d86a58, pageno=1635) at 
../../../src/include/lib/simplehash.h:508
#8  tbm_get_pageentry (tbm=0x7f84a8d86a58, pageno=1635) at tidbitmap.c:1225
#9  0x00687c50 in tbm_add_tuples (tbm=0x7f84a8d86a58, tids=, ntids=1, recheck=false) at tidbitmap.c:405
#10 0x004e43df in btgetbitmap (scan=0x2829fa8, tbm=0x7f84a8d86a58) at 
nbtree.c:332
#11 0x004d8a91 in index_getbitmap (scan=0x2829fa8, bitmap=) at indexam.c:726
#12 0x00647c98 in MultiExecBitmapIndexScan (node=0x2829720) at 
nodeBitmapIndexscan.c:104
#13 0x00646078 in MultiExecBitmapOr (node=0x28046e8) at 
nodeBitmapOr.c:153
#14 0x00646afd in BitmapHeapNext (node=0x2828db8) at 
nodeBitmapHeapscan.c:145
#15 0x0063a550 in ExecScanFetch (node=0x2828db8, accessMtd=0x6469e0 
, recheckMtd=0x646740 ) at execScan.c:95
#16 ExecScan (node=0x2828db8, accessMtd=0x6469e0 , 
recheckMtd=0x646740 ) at execScan.c:162
#17 0x00638ce0 in ExecProcNodeInstr (node=0x2828db8) at 
execProcnode.c:461
#18 0x006414fc in ExecProcNode (pstate=) at 
../../../src/include/executor/executor.h:237
#19 ExecAppend (pstate=) at nodeAppend.c:294
#20 0x00638ce0 in ExecProcNodeInstr (node=0x27cb0a0) at 
execProcnode.c:461
#21 0x006349c7 in ExecProcNode (queryDesc=0x7f84a8de7520, 
direction=, count=0, execute_once=160) at 
../../../src/include/executor/executor.h:237
#22 ExecutePlan (queryDesc=0x7f84a8de7520, direction=, 
count=0, execute_once=160) at execMain.c:1723
#23 standard_ExecutorRun (queryDesc=0x7f84a8de7520, direction=, count=0, execute_once=160) at execMain.c:364
#24 0x7f84a97c8618 in pgss_ExecutorRun (queryDesc=0x7f84a8de7520, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
pg_stat_statements.c:892
#25 0x7f84a91aa7dd in explain_ExecutorRun (queryDesc=0x7f84a8de7520, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
auto_explain.c:268
#26 0x00635071 in ParallelQueryMain (seg=0x268fba8, toc=0x7f84a93ed000) 
at execParallel.c:1402
#27 0x00508f34 in ParallelWorkerMain (main_arg=) 
at parallel.c:1409
#28 0x00704760 in StartBackgroundWorker () at bgworker.c:834
#29 0x0070e11c in do_start_bgworker () at postmaster.c:5698
#30 maybe_start_bgworkers () at postmaster.c:5911
#31 0x00710786 in sigusr1_handler (postgres_signal_arg=) at postmaster.c:5091
#32 
#33 0x0037b9ce1603 in __select_nocancel () from /lib64/libc.so.6
#34 0x0071300e in ServerLoop (argc=, argv=) at postmaster.c:1670
#35 PostmasterMain (argc=, argv=) at 
postmaster.c:1379
#36 0x0067e8c0 in main (argc=3, argv=0x265f960) at main.c:228

#0  0x0037b9c32495 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x0037b9c33c75 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0037b9c2b60e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x0037b9c2b6d0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x008c6f74 in dsa_allocate_extended (area=0x27c05e0, size=393220, 
flags=5) at dsa.c:729
npages = 97
first_page = 
span_pointer = 1099511632488
pool = 0x7f84a93eecf0
size_class = 
start_pointer = 
segment_map = 
result = 140207751879680
__func__ = "dsa_allocate_extended"
__PRETTY_FUNCTION__ = "dsa_allocate_extended"
#

Re: Commit Fest 2019-01 is now closed

2019-02-05 Thread Peter Geoghegan
On Tue, Feb 5, 2019 at 9:08 AM Andres Freund  wrote:
> What we'd talked about briefly at the Fosdem dev meeting was that a
> field 'target release' or 'target branch' would be very useful to be
> able to focus attention more. There's plenty stuff in the current CF
> that is getting some attention, and deserves a bit more, but that
> clearly isn't aimed for v12. Being able to filter by that would be huge.

That sounds like a good idea to me.

-- 
Peter Geoghegan



Re: Changing SQL Inlining Behaviour (or...?)

2019-02-05 Thread Paul Ramsey



> On Feb 3, 2019, at 3:47 PM, Tom Lane  wrote:
> 
> I wrote:
>> I've posted some preliminary design ideas at
>> https://www.postgresql.org/message-id/15193.1548028...@sss.pgh.pa.us
>> and
>> https://www.postgresql.org/message-id/15289.1548028...@sss.pgh.pa.us
>> While there's a nontrivial amount of work needed to make that happen,
>> I think it's doable, and it would lead to a significantly better
>> solution than proceeding along the inlining path could do.  My
>> current feeling, therefore, is that we should reject this patch
>> (or at least stick it in the deep freeze) and go work on that plan.
> 
> Now that the first of those threads has reached a feature-complete
> state, I feel fairly comfortable in saying that we should drop the
> idea of messing with the inlining heuristics (at least for the
> particular end goal stated in this thread).  So I'm going to go
> close this CF entry as returned-with-feedback.
> 
>   regards, tom lane

Hokay… I’ve read through the patch set, applied it and built it, all works. Am 
starting to try a test implementation in PostGIS land. Robert’s comment about 
“PostgreSQL magic” is ringing through my head ... Nodes and Ops and Exprs, oh 
my! What ever doesn’t kill me only makes me stronger, right? :)

P.


Re: patch to allow disable of WAL recycling

2019-02-05 Thread Jerry Jelinek
On Mon, Oct 1, 2018 at 7:16 PM Michael Paquier  wrote:

> On Thu, Sep 13, 2018 at 02:56:42PM -0600, Jerry Jelinek wrote:
> > I'll take a look at that. I had been trying to keep the patch as minimal
> as
> > possible, but I'm happy to work through this.
>
> (Please be careful with top-posting)
>
> Jerry, the last status was from three weeks ago with the patch waiting
> on the author, so I am marking it as returned with feedback.
> --
> Michael
>

I'd like to see if I can get this patch moving forward again. I have made
several changes to the patch since the last time this was discussed.

First, since last fall, we have found another performance problem related
to initializing WAL files. I've described this issue in more detail below,
but in order to handle this new problem, I decided to generalize the patch
so the tunable refers to running on a Copy-On-Write filesystem instead of
just being specific to WAL recycling. Specifically, I renamed the GUC
tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
more obvious what is being tuned and will also be more flexible if there
are other problems in the future which are related to running on a COW
filesystem. I'm happy to choose a different name for the tunable if people
don't like 'wal_cow_fs'.

Second, I've modified the WAL recycling code change as requested earlier.

Third, this new patch is rebased onto the current code base.

Finally, the patch now includes bypassing the zero-fill for new WAL files
when wal_cow_fs is true. Hopefully it should be obvious why this is
unnecessary for a COW filesystem, but here is some more information about
how this can cause a performance problem, at least on ZFS. As background,
internally ZFS will skip allocating zero-filled blocks, but that is handled
fairly late in the filesystem code flow. First, all of the thousands of
initialization 8k write system calls are being made by Postgres. ZFS will
throttle writes under some circumstances. We found that all of the writes
from XLogFileInit, while Postgres is also doing an autovacuum, will trigger
write throttling due to the large amount of write traffic induced by the
autovacuum. This problem occurs even when WAL files are being
recycled. That seems to be related to the fact that the checkpointer –
which is responsible for WAL file recycling – is taking so long that it is
falling behind its own estimate for WAL activity.

The revised patch is attached. Please let me know if there are any comments.

Thanks,
Jerry


0001-cow-filesystem.patch
Description: Binary data


Re: Changing SQL Inlining Behaviour (or...?)

2019-02-05 Thread Tom Lane
Paul Ramsey  writes:
> Hokay… I’ve read through the patch set, applied it and built it, all works. 
> Am starting to try a test implementation in PostGIS land. Robert’s comment 
> about “PostgreSQL magic” is ringing through my head ... Nodes and Ops and 
> Exprs, oh my! What ever doesn’t kill me only makes me stronger, right? :)

Thanks for looking at it!  I think you are going to have some issues
with identifying the operators to use in PostGIS, since unlike the
core code you can't assume fixed OIDs for anything.  The other thread
I'd started has some ideas about fixing that, and I hope to get
something into v12 for it, but it's not done yet.

For testing purposes it might be enough to look up your index opfamilies
by name, though I'd not really recommend that as a production answer.

You're also kind of jumping the gun on the documentation ;-).  I need
to write a lot more in supportnodes.h about how to use the IndexCondition
callback, but the basic idea is to generate a list of OpExprs (at least
that'd be the normal case) that represent a directly-indexable condition.
It's entirely up to you to ensure that they *are* indexable conditions,
ie with an operator that belongs to the index's opfamily, index key on
the left, pseudoconstant value on the right.  The sample code that's
there now for LIKE/regex kind of punts on the last point: since it
can only do anything useful with a simple Const pattern, it doesn't
have to think hard about what's acceptable.  If you want to allow
non-simple-Const RHS then you need to reject volatile functions and
Vars of the indexed relation.  I was thinking of exposing a function
specifically to make that test, rather than requiring extensions to
copy the logic, but I didn't do it yet.

Anyway, happy to help if you have questions, just ask.

regards, tom lane



Re: Changing SQL Inlining Behaviour (or...?)

2019-02-05 Thread Paul Ramsey



> On Feb 5, 2019, at 11:07 AM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> Hokay… I’ve read through the patch set, applied it and built it, all works. 
>> Am starting to try a test implementation in PostGIS land. Robert’s comment 
>> about “PostgreSQL magic” is ringing through my head ... Nodes and Ops and 
>> Exprs, oh my! What ever doesn’t kill me only makes me stronger, right? :)
> 
> Thanks for looking at it!  I think you are going to have some issues
> with identifying the operators to use in PostGIS, since unlike the
> core code you can't assume fixed OIDs for anything.  The other thread
> I'd started has some ideas about fixing that, and I hope to get
> something into v12 for it, but it's not done yet.
> 
> For testing purposes it might be enough to look up your index opfamilies
> by name, though I'd not really recommend that as a production answer.
> 
> You're also kind of jumping the gun on the documentation ;-).  I need
> to write a lot more in supportnodes.h about how to use the IndexCondition
> callback, but the basic idea is to generate a list of OpExprs (at least
> that'd be the normal case) that represent a directly-indexable condition.
> It's entirely up to you to ensure that they *are* indexable conditions,
> ie with an operator that belongs to the index's opfamily, index key on
> the left, pseudoconstant value on the right.  The sample code that's
> there now for LIKE/regex kind of punts on the last point: since it
> can only do anything useful with a simple Const pattern, it doesn't
> have to think hard about what's acceptable.  If you want to allow
> non-simple-Const RHS then you need to reject volatile functions and
> Vars of the indexed relation.  I was thinking of exposing a function
> specifically to make that test, rather than requiring extensions to
> copy the logic, but I didn't do it yet.



So just a meta-comment, this is all very cool and I can see how it will help 
out things like selectivity estimation and tuple return estimation for unnest() 
and generate_series() and even how I could eventually do some dynamic costing 
for some functions, but it’s getting very deep and far from our proximate 
problem which was just we were having trouble increasing the costs on our 
functions so we could get more frequent parallelization. 

We’re going to have a net complexity increase in our code base, because we’ll 
be maintaining this new approach alongside the old one for about 5 years until 
v12 becomes our oldest supported PostgreSQL.

And this paragraph above (excuse my ignorance) worries me that I might have a 
bunch of use patterns I now have to explicitly predict will happen and support 
that I might have gotten “for free” with the old dumbass inlining approach.

The only thing that will get more efficient for us, I think, will be 
ST_DWithin, which can pick the correct index op to use instead of supplying 
both of them. (Maybe? I think I’ll end up writing a bunch of logic which I 
previously got “for free” to (a) find which indexes are available? (b) if I 
have an index on *both* columns, check the selectivity of 'expand(b) && a' vs 
'b && expand(a)’ and (c) build up an appropriate new structure that 
incorporates the index *and* the expand() function call wrapper on the 
appropriate side.



Anyways, once I have done an implementation it will probably appear less 
daunting.

P 


> 
> Anyway, happy to help if you have questions, just ask.
> 
>   regards, tom lane




Re: Changing SQL Inlining Behaviour (or...?)

2019-02-05 Thread Tom Lane
Paul Ramsey  writes:
> So just a meta-comment, this is all very cool and I can see how it will
> help out things like selectivity estimation and tuple return estimation
> for unnest() and generate_series() and even how I could eventually do
> some dynamic costing for some functions, but it’s getting very deep and
> far from our proximate problem which was just we were having trouble
> increasing the costs on our functions so we could get more frequent
> parallelization.

Sure, but you need to think bigger about what problem you're solving ;-).
There was no chance that we were going to back-patch that inlining rule
change hack, so none of this would provide you an opportunity to do better
in existing releases anyway.  With a bit more work --- and, honestly,
I think it should not end up being more than ~100 lines of C code
for you, else I've done something wrong --- we can move the goalposts
quite a long way in v12.

> The only thing that will get more efficient for us, I think, will be 
> ST_DWithin, which can pick the correct index op to use instead of supplying 
> both of them. (Maybe? I think I’ll end up writing a bunch of logic which I 
> previously got “for free” to (a) find which indexes are available? (b) if I 
> have an index on *both* columns, check the selectivity of 'expand(b) && a' vs 
> 'b && expand(a)’ and (c) build up an appropriate new structure that 
> incorporates the index *and* the expand() function call wrapper on the 
> appropriate side.

No, I don't think so.  The only aspect of that that the support function
is responsible for is verifying that the currently-looked-at index is
applicable, which boils down to checking that the index opfamily is
what you're expecting.  As I mentioned, the core code is just doing that
by testing for compiled-in OIDs, but extensions are going to have to
work a little harder because their opfamilies haven't got fixed OIDs.

Roughly speaking, what you've got to do is

* verify the index's opfamily OID (OK, some uncertainty
  remains about best way to do that)

* find out operator OID for && --- if nothing else,
  get_opfamily_member() will serve for that, once you've
  got the opfamily OID

* build index operator expression with make_opclause().

Figuring out the rest is the domain of the core code.  If there
are decisions to be made, it'd boil down to whether you've attached
good selectivity estimators to the index operators ... but if you
don't have that then I doubt you were getting good results before.

Anyway, as I said, I'm plenty happy to help.  I'd like to see how
this code turns out in PostGIS; if it's really all that complicated
then I need to spend some effort on adding infrastructure to make
it simpler.

regards, tom lane



Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2019-02-05 Thread Alvaro Herrera
On 2019-Jan-29, Tom Lane wrote:

> The closest I could get to solving it along the original lines
> went like this:
> 
>  * In addition, we support INTERNAL_AUTO dependencies, which alter the
>  * rules for this.  If the target has both INTERNAL and INTERNAL_AUTO
>  * dependencies, then it can be dropped if any one of those objects is
>  * dropped, but not unless at least one of them is.  In this situation we
>  * mustn't automatically transform a random deletion request into a
>  * parent-object deletion.  Instead, we proceed with deletion if we are
>  * recursing from any owning object, and otherwise set the object aside to
>  * recheck at the end.  If we don't later find a reason to delete one of
>  * the owning objects, we'll throw an error.
> 
> I think that's ugly as sin: INTERNAL means something different if there's
> another dependency beside it?

I agree it's pretty weird :-(

> A theoretically purer answer is to split INTERNAL_AUTO into two new
> dependency types, one linking to the real "owning" object (the parent
> index or trigger) and the other to the secondary owner (the partition
> table), while leaving regular INTERNAL as it stands.  I don't especially
> want to go that way, though: it seems overcomplicated from the standpoint
> of outside inspections of pg_depend, and it'd be very risky to back-patch
> into v11.  (For instance, if someone went backwards on their postmaster
> minor release, the older v11 version would not know what to do with the
> new dependency type.)

I don't put a lot of value in the idea of people going back in minor
releases, but I agree with the general principle of not making pg_depend
any harder to inspect than it already is.  Even so, I wouldn't dismiss a
solution along these lines, if it was clean enough; we could provide
better tools for pg_depend inspection (some sensible view, at least).
It might make sense to implement this in master only, if it turns out to
be better in some way.  I'm not volunteering right now, but maybe in the
future.

> It strikes me however that we can stick with the existing catalog contents
> by making this definition: of the INTERNAL_AUTO dependencies of a given
> object, the "true owner" to be reported in errors is the dependency
> that is of the same classId as that object.  If this doesn't uniquely
> identify one dependency, the report will mention a random one.  This is
> ugly as well, but it's certainly a lot more practical to cram into v11,
> and it seems like it would cover the current and likely future use-cases
> for partition-related objects.  When and if we find a use-case for which
> this doesn't work, we can think about extending the catalog representation
> to identify a unique owner in a less ad-hoc way.

This seems a great idea.  Do you want me to implement it?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Too rigorous assert in reorderbuffer.c

2019-02-05 Thread Alvaro Herrera
On 2019-Jan-31, Arseny Sher wrote:

> My colleague Alexander Lakhin has noticed an assertion failure in
> reorderbuffer.c:1330. Here is a simple snippet reproducing it:
> 
> SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 
> 'test_decoding');
> 
> create table t(k int);
> begin;
> savepoint a;
> alter table t alter column k type text;
> rollback to savepoint a;
> alter table t alter column k type bigint;
> commit;
> 
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
> 'include-xids', '0', 'skip-empty-xacts', '1');

Hmm, the new test introduced by your patch fails in early branches (at
least 9.4): the transaction is decoded like this:

! data 
! -
   BEGIN
   table public.tr_sub_ddl: INSERT: data[integer]:42
+  table public.pg_temp_16445: INSERT: data[bigint]:42
   table public.tr_sub_ddl: INSERT: data[bigint]:43
   COMMIT
! (5 rows)

note the additional pg_temp_XYZ row in the middle.  This is caused by
the rewrite in ALTER TABLE.  Peter E fixed that in Pg11 in commit
325f2ec55; I don't think there's much to do in the backbranches other
than hide the pesky record to avoid it breaking the test.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Macros bundling RELKIND_* conditions

2019-02-05 Thread Alvaro Herrera
On 2018-Dec-19, Tom Lane wrote:

> Alvaro Herrera  writes:

> > I support this idea.  Here's a proof-of-concept patch that corresponds
> > to one of the cases that Ashutosh was on about (specifically, the one
> > that uses the RELKIND_CAN_HAVE_STORAGE macro I just added).  If there
> > are no objections to this approach, I'm going to complete it along these
> > lines.
> 
> +1

Here's a v2 -- completed patching the other places that needed it, fixed
test expected output, and made the other changes you suggested.

It's a bit scary that none of the backend message wording changes affect
any tests.  Only contrib test files had to be patched.  (I didn't run
sepgsql tests.)

One (very small IMO) problem here is that Ashutosh started from the
desire of being able to notice when a new relkind needs to be added to
checks spread in the code.  This patch doesn't really do that; the new
"errdetail_unsuitable_relkind" can be grepped for but that doesn't
really solve it because some places don't throw errors, but just, say,
"continue".  I don't see a solution for this.

But I do see one further problem, which is that each message being
patched was listing the relkinds that are allowed and now they ain't.
It seems user-friendly to preserve that list in some way.  An example:

diff --git a/src/backend/commands/indexcmdsindex bd85099c28..7a255176df 100644
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 437,444  DefineIndex(Oid relationId,
default:
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
!errmsg("\"%s\" is not a table or 
materialized view",
!   
RelationGetRelationName(rel;
break;
}
  
--- 437,447 
default:
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
!errmsg("cannot create index on \"%s\"",
!   
RelationGetRelationName(rel)),
!
errdetail_unsuitable_relkind(rel->rd_rel->relkind,
!   
  RelationGetRelationName(rel)),
!errhint("Indexes can only be created 
on regular tables, materialized views and partitioned tables.")));
break;
}
  

This requires inventing a new message for at least some of the patched
sites.  I'm not happy about using errhint() for that, since that's not a
hint, but I don't see anything else we could use.

(I'm not sure about using the term "regular table" for RELKIND_RELATION,
but I think using unadorned "table" doesn't cut it anymore -- we have
TOAST tables, foreign tables, and partitioned tables, neither of which
are tables for many of these checks.  If you look at DefineIndex you'll
see that we even have a separate message for foreign tables there, which
kinda proves my point.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 5cfb74bfa200fb34b1ca45546c4285d1886caab9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 19 Dec 2018 14:33:07 -0300
Subject: [PATCH v2] unsuitable relkind

---
 .../pg_visibility/expected/pg_visibility.out  | 75 ---
 contrib/pg_visibility/pg_visibility.c |  7 +-
 contrib/pgstattuple/expected/pgstattuple.out  |  9 ++-
 contrib/pgstattuple/pgstatindex.c | 13 ++--
 src/backend/catalog/catalog.c | 45 +++
 src/backend/catalog/toasting.c|  6 +-
 src/backend/commands/comment.c|  7 +-
 src/backend/commands/seclabel.c   |  6 +-
 src/include/catalog/catalog.h |  2 +
 9 files changed, 127 insertions(+), 43 deletions(-)

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index f0dcb897c4..d72ea2f63e 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -6,66 +6,91 @@ CREATE EXTENSION pg_visibility;
 create table test_partitioned (a int) partition by list (a);
 -- these should all fail
 select pg_visibility('test_partitioned', 0);
-ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
+ERROR:  cannot examine visibility information for "test_partitioned"
+DETAIL:  "test_partitioned" is a partitioned table.
 select pg_visibility_map('test_partitioned');
-ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
+ERROR:  cannot examine visibility information for "test_partitioned"
+DETAIL:  "test_partitioned" is a partitioned table.
 select pg_visibility_map_summa

Bogus lateral-reference-propagation logic in create_lateral_join_info

2019-02-05 Thread Tom Lane
While poking at bug #15613 (in which FDWs are failing to mark created
Paths with correct outer-reference sets), I thought it'd be a good idea
to add some asserts to Path creation saying that every Path should be
parameterized by at least whatever the relation's required LATERAL
references are.  I did that as per the added assertions in relnode.c
below.  I didn't expect any failures in the existing regression tests,
since we know those don't exercise bug #15613, but darn if the addition
to get_appendrel_parampathinfo didn't blow up in the core tests.

A bit of excavation later, it turns out that this is a bug since day 1
in create_lateral_join_info.  It needs to propagate lateral_relids
and related fields from appendrel parents to children, but it was not,
in the original code, accounting for the possibility of grandchildren.
Those need to get the lateral fields propagated from their topmost
ancestor too, but they weren't.  This leads to having an intermediate
appendrel that is marked with some lateral_relids when its children
are not, causing add_paths_to_append_rel to believe that unparameterized
paths can be built, triggering the new assertion.  I'm not sure if there
are any worse consequences; the regression test case that's triggering
the Assert seems to work otherwise, so we might be accidentally failing
to fail.  But it's not supposed to be like that.

Commit 0a480502b hacked this code up to deal with grandchildren for
the case of partitioned tables, but the regression test that's falling
over involves nested UNION ALL subqueries, which are not that.  Rather
than add another RTE-kind exception, though, I think we ought to rewrite
it completely and get rid of the nested loops in favor of one traversal
of the append_rel_list.  This does require assuming that the
append_rel_list has ancestor entries before descendant entries, but
that's okay because of the way the list is built.  (I note that 0a480502b
is effectively assuming that ancestors appear before children in the RTE
list, which is no safer an assumption.)

Also, I'd really like to know why I had to put in the exception seen
in the loop for AppendRelInfos that do not point to a valid parent.
It seems to me that that is almost certainly working around a bug in
the partitioning logic.  (Without it, the partition_prune regression test
crashes.)  Or would somebody like to own up to having created that state
of affairs intentionally?  If so why?

Anyway, I propose to commit and back-patch the initsplan.c part of the
attached.  The added asserts in relnode.c should probably not go in
until we have a bug #15613 fix that will prevent postgres_fdw from
triggering them, so I'll do that later.

regards, tom lane

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index d6ffa78..22a6da3 100644
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
*** void
*** 419,424 
--- 419,425 
  create_lateral_join_info(PlannerInfo *root)
  {
  	bool		found_laterals = false;
+ 	Relids		prev_parents PG_USED_FOR_ASSERTS_ONLY = NULL;
  	Index		rti;
  	ListCell   *lc;
  
*** create_lateral_join_info(PlannerInfo *ro
*** 627,679 
  	 * every child anyway, and there's no value in forcing extra
  	 * reparameterize_path() calls.  Similarly, a lateral reference to the
  	 * parent prevents use of otherwise-movable join rels for each child.
  	 */
! 	for (rti = 1; rti < root->simple_rel_array_size; rti++)
  	{
! 		RelOptInfo *brel = root->simple_rel_array[rti];
! 		RangeTblEntry *brte = root->simple_rte_array[rti];
! 
! 		/*
! 		 * Skip empty slots. Also skip non-simple relations i.e. dead
! 		 * relations.
! 		 */
! 		if (brel == NULL || !IS_SIMPLE_REL(brel))
! 			continue;
  
  		/*
! 		 * In the case of table inheritance, the parent RTE is directly linked
! 		 * to every child table via an AppendRelInfo.  In the case of table
! 		 * partitioning, the inheritance hierarchy is expanded one level at a
! 		 * time rather than flattened.  Therefore, an other member rel that is
! 		 * a partitioned table may have children of its own, and must
! 		 * therefore be marked with the appropriate lateral info so that those
! 		 * children eventually get marked also.
  		 */
! 		Assert(brte);
! 		if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
! 			(brte->rtekind != RTE_RELATION ||
! 			 brte->relkind != RELKIND_PARTITIONED_TABLE))
  			continue;
  
! 		if (brte->inh)
! 		{
! 			foreach(lc, root->append_rel_list)
! 			{
! AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
! RelOptInfo *childrel;
  
! if (appinfo->parent_relid != rti)
! 	continue;
! childrel = root->simple_rel_array[appinfo->child_relid];
! Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
! Assert(childrel->direct_lateral_relids == NULL);
! childrel->direct_lateral_relids = brel->direct_lateral_relids;
! Assert(childrel->lateral_relids == N

Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Alvaro Herrera
On 2019-Feb-05, Tomas Vondra wrote:

> I don't think we need to remove the expired entries right away, if there
> are only very few of them. The cleanup requires walking the hash table,
> which means significant fixed cost. So if there are only few expired
> entries (say, less than 25% of the cache), we can just leave them around
> and clean them if we happen to stumble on them (although that may not be
> possible with dynahash, which has no concept of expiration) of before
> enlarging the hash table.

I think seqscanning the hash table is going to be too slow; Ideriha-san
idea of having a dlist with the entries in LRU order (where each entry
is moved to head of list when it is touched) seemed good: it allows you
to evict older ones when the time comes, without having to scan the rest
of the entries.  Having a dlist means two more pointers on each cache
entry AFAIR, so it's not a huge amount of memory.

> So if we want to address this case too (and we probably want), we may
> need to discard the old cache memory context someho (e.g. rebuild the
> cache in a new one, and copy the non-expired entries). Which is a nice
> opportunity to do the "full" cleanup, of course.

Yeah, we probably don't want to do this super frequently though.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2019-02-05 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Jan-29, Tom Lane wrote:
>> It strikes me however that we can stick with the existing catalog contents
>> by making this definition: of the INTERNAL_AUTO dependencies of a given
>> object, the "true owner" to be reported in errors is the dependency
>> that is of the same classId as that object.  If this doesn't uniquely
>> identify one dependency, the report will mention a random one.  This is
>> ugly as well, but it's certainly a lot more practical to cram into v11,
>> and it seems like it would cover the current and likely future use-cases
>> for partition-related objects.  When and if we find a use-case for which
>> this doesn't work, we can think about extending the catalog representation
>> to identify a unique owner in a less ad-hoc way.

> This seems a great idea.  Do you want me to implement it?

I've got much of the code for it already (in the wreckage of my failed
attempts), so I'll go back and finish that up.  I was just waiting to see
how loudly people would howl about using object type as a condition for
figuring out what a pg_depend entry really means.  If we're okay with
that hack, I think I can make it work.

regards, tom lane



Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Tomas Vondra
On 2/5/19 11:05 PM, Alvaro Herrera wrote:
> On 2019-Feb-05, Tomas Vondra wrote:
> 
>> I don't think we need to remove the expired entries right away, if there
>> are only very few of them. The cleanup requires walking the hash table,
>> which means significant fixed cost. So if there are only few expired
>> entries (say, less than 25% of the cache), we can just leave them around
>> and clean them if we happen to stumble on them (although that may not be
>> possible with dynahash, which has no concept of expiration) of before
>> enlarging the hash table.
> 
> I think seqscanning the hash table is going to be too slow; Ideriha-san
> idea of having a dlist with the entries in LRU order (where each entry
> is moved to head of list when it is touched) seemed good: it allows you
> to evict older ones when the time comes, without having to scan the rest
> of the entries.  Having a dlist means two more pointers on each cache
> entry AFAIR, so it's not a huge amount of memory.
> 

Possibly, although my guess is it will depend on the number of entries
to remove. For small number of entries, the dlist approach is going to
be faster, but at some point the bulk seqscan gets more efficient.

FWIW this is exactly where a bit of benchmarking would help.

>> So if we want to address this case too (and we probably want), we may
>> need to discard the old cache memory context someho (e.g. rebuild the
>> cache in a new one, and copy the non-expired entries). Which is a nice
>> opportunity to do the "full" cleanup, of course.
> 
> Yeah, we probably don't want to do this super frequently though.
> 

Right. I've also realized the resizing is built into dynahash and is
kinda incremental - we add (and split) buckets one by one, instead of
immediately rebuilding the whole hash table. So yes, this would need
more care and might need to interact with dynahash in some way.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2019-02-05 Thread Peter Geoghegan
On Tue, Feb 5, 2019 at 2:08 PM Tom Lane  wrote:
> I've got much of the code for it already (in the wreckage of my failed
> attempts), so I'll go back and finish that up.  I was just waiting to see
> how loudly people would howl about using object type as a condition for
> figuring out what a pg_depend entry really means.  If we're okay with
> that hack, I think I can make it work.

Perhaps I've missed some subtlety, but I'm not sure that it's all that
ugly. If splitting INTERNAL_AUTO into two new dependency types amounts
to the same thing as what you suggest here, then what's the
difference? If this secondary INTERNAL_AUTO entry property can be
determined from the pg_depend record alone with either approach, then
it's not obvious to me that an "explicit representation" buys us
anything. Yes, you must introduce a special case...but isn't it a
special case either way?

-- 
Peter Geoghegan



Re: Internal error while setting reloption on system catalogs.

2019-02-05 Thread Kyotaro HORIGUCHI
At Tue, 05 Feb 2019 10:01:48 -0500, Tom Lane  wrote in 
<20605.1549378...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > The following command complains with an internal
> > error. (allow_system_table_mods is on).
> 
> > alter table pg_attribute set (fillfactor = 90);
> >> ERROR:  AccessExclusiveLock required to add toast table.
> 
> > The same happens for pg_class.
> 
> Isn't this more or less the same thing Peter E. was attempting to fix in
> https://commitfest.postgresql.org/22/1764/
> ?

Uggh Right. One of the problem is the same with this. Thank you
for the pointer, Tom.

> > The first attached is for master and 11.
> > The second is for 10.
> > The third is for 9.6.
> 
> It's not clear to me that we ought to consider catalog alterations
> supported at all, but in any case I wouldn't treat it as something
> to back-patch.

I'm ok with that. Considering using reloption on system catalogs
let me notice it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Libpq support to connect to standby server as priority

2019-02-05 Thread Haribabu Kommi
On Mon, Jan 21, 2019 at 5:48 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> > Thanks for finding out the problem, how about the following way of
> checking
> > for prefer-read/prefer-standby.
> >
> > 1. (default_transaction_read_only = true and recovery = true)
> >
> > 2. If none of the host satisfies the above scenario, then recovery = true
> > 3. Last check is for default_transaction_read_only = true
>
> That would be fine.  But as I mentioned in another mail, I think "get
> read-only session" and "connect to standby" differ.  So I find it better to
> separate parameters for those request; target_session_attr and
> target_server_type.
>

Thanks for your opinion.

target_session_attrs checks for the default_transaction_readonly or not?
target_server_type checks for whether the server is in recovery or not?

I feel having two options make this feature complex to use it from the user
point of view?

The need of two options came because of a possibility of a master server
with default_transaction_readonly set to true. Even if the default
transaction
is readonly, it is user changeable parameter, so there shouldn't be any
problem.

The same can be applied for logical replication also, the user can change
the
default transaction mode once the connection is established, if it is not
according
to it's requirement.

how about just adding one parameter that takes the options similar like
JDBC?
target_server_type - Master, standby and prefer-standby. (The option names
can revised based on the common words on the postgresql docs?)

And one more thing, what happens when the server promotes to master but
the connection requested is standby? I feel we can maintain the existing
connections
and later new connections can be redirected? comments?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-02-05 Thread Peter Geoghegan
On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan  wrote:
> Thanks for going to the trouble of implementing what you have in mind!
>
> I agree that the code that I wrote within nbtsplitloc.c is very
> subtle, and I do think that I have further work to do to make its
> design clearer. I think that this high level description of the goals
> of the algorithm is inaccurate in subtle but important ways, though.
> Hopefully there will be a way of making it more understandable that
> preserves certain important characteristics.

Heikki and I had the opportunity to talk about this recently. We found
an easy way forward. I believe that the nbtsplitloc.c algorithm itself
is fine -- the code will need to be refactored, though.

nbtsplitloc.c can be refactored to assemble a list of legal split
points up front, before deciding which one to go with in a separate
pass (using one of two "alternative modes", as before). I now
understand that Heikki simply wants to separate the questions of "Is
this candidate split point legal?" from "Is this known-legal candidate
split point good/ideal based on my current criteria?". This seems like
a worthwhile goal to me. Heikki accepts the need for multiple
modes/passes, provided recursion isn't used in the implementation.

It's clear to me that the algorithm should start off trying to split
towards the middle of the page (or towards the end in the rightmost
case), while possibly making a small compromise on the exact split
point to maximize the effectiveness of suffix truncation. We must
change strategy entirely if and only if the middle of the page (or
wherever we'd like to split initially) is found to be completely full
of duplicates -- that's where the need for a second pass comes in.
This should almost never happen in most applications. Even when it
happens, we only care about not splitting inside a group of
duplicates. That's not the same thing as caring about maximizing the
number of attributes truncated away. Those two things seem similar,
but are actually very different.

It might have sounded like Heikki and I disagreed on the design of the
algorithm at a high level, or what its goals ought to be. That is not
the case, though. (At least not so far.)

-- 
Peter Geoghegan



Re: Bogus lateral-reference-propagation logic in create_lateral_join_info

2019-02-05 Thread David Rowley
On Wed, 6 Feb 2019 at 10:57, Tom Lane  wrote:
> Also, I'd really like to know why I had to put in the exception seen
> in the loop for AppendRelInfos that do not point to a valid parent.
> It seems to me that that is almost certainly working around a bug in
> the partitioning logic.  (Without it, the partition_prune regression test
> crashes.)  Or would somebody like to own up to having created that state
> of affairs intentionally?  If so why?

Sounds pretty strange to me.   What exactly do you mean by not
pointing to a valid parent? Do you mean the parent_relid index is not
a valid RelOptInfo?

Can you point to the regression test that's doing this?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



RE: speeding up planning with partitions

2019-02-05 Thread Tsunakawa, Takayuki
Imai-san,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> On 2019/01/22 18:47, David Rowley wrote:
> > On Tue, 22 Jan 2019 at 20:01, Imai, Yoshikazu
> >> What I understand so far is about 10,000 while loops at total
> (4098+4098+some extra) is needed in hash_seq_search() in EXECUTE query
> after the creation of the generic plan.
> >> 10,000 while loops takes about 10 microsec (of course, we can't estimate
> correct time), and the difference of the latency between 5th and 7th EXECUTE
> is about 8 microsec, I currently think this causes the difference.
> >  >
> >> I don't know this problem relates to Amit-san's patch, but I'll continue
> to investigate it.
> >
> > I had another thought... when you're making a custom plan you're only
> > grabbing locks on partitions that were not pruned (just 1 partition in
> > your case), but when making the generic plan, locks will be acquired
> > on all partitions (all 4000 of them). This likely means that when
> > building the generic plan for the first time that the
> > LockMethodLocalHash table is expanded to fit all those locks, and
> > since we never shrink those down again, it'll remain that size for the
> > rest of your run.  I imagine the reason for the slowdown is that
> > during LockReleaseAll(), a sequential scan is performed over the
> > entire hash table. I see from looking at the hash_seq_search() code
> > that the value of max_bucket is pretty critical to how it'll perform.
> > The while ((curElem = segp[segment_ndx]) == NULL) loop will need to
> > run fewer times with a lower max_bucket.
> 
> I too think that that might be behind that slight drop in performance.
> So, it's good to know what one of the performance bottlenecks is when
> dealing with large number of relations in queries.

Can you compare the performance of auto and force_custom_plan again with the 
attached patch?  It uses PGPROC's LOCALLOCK list instead of the hash table.


Regards
Takayuki Tsunakawa





faster-locallock-scan.patch
Description: faster-locallock-scan.patch


Re: pg11.1: dsa_area could not attach to segment

2019-02-05 Thread Justin Pryzby
I should have included query plan for the query which caused the "could not
find free pages" error.

This is a contrived query which I made up to try to exercise/stress bitmap
scans based on Thomas's working hypothesis for this error/bug.  This seems to
be easier to hit than the other error ("could not attach to segment") - a loop
around this query has run into "free pages" several times today.

explain (analyze,costs off,timing off) SELECT * FROM eric_enodeb_metrics WHERE 
start_time>='2017-10-01' AND (site_id<1900 OR site_id>2700)

 Gather (actual rows=82257 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Parallel Append (actual rows=20564 loops=4)
 ->  Parallel Bitmap Heap Scan on eric_enodeb_201901 (actual rows=6366 
loops=4)
   Recheck Cond: ((site_id < 1900) OR (site_id > 2700))
   Filter: (start_time >= '2017-10-01 00:00:00-04'::timestamp with 
time zone)
   Heap Blocks: exact=2549
   ->  BitmapOr (actual rows=0 loops=1)
 ->  Bitmap Index Scan on eric_enodeb_201901_site_idx 
(actual rows=0 loops=1)
   Index Cond: (site_id < 1900)
 ->  Bitmap Index Scan on eric_enodeb_201901_site_idx 
(actual rows=25463 loops=1)
   Index Cond: (site_id > 2700)
 ->  Parallel Bitmap Heap Scan on eric_enodeb_201810 (actual rows=15402 
loops=1)
   Recheck Cond: ((site_id < 1900) OR (site_id > 2700))
   Filter: (start_time >= '2017-10-01 00:00:00-04'::timestamp with 
time zone)
   ->  BitmapOr (actual rows=0 loops=1)
 ->  Bitmap Index Scan on eric_enodeb_201810_site_idx 
(actual rows=0 loops=1)
   Index Cond: (site_id < 1900)
 ->  Bitmap Index Scan on eric_enodeb_201810_site_idx 
(actual rows=15402 loops=1)
   Index Cond: (site_id > 2700)
 ->  Parallel Bitmap Heap Scan on eric_enodeb_201812 (actual rows=14866 
loops=1)
   Recheck Cond: ((site_id < 1900) OR (site_id > 2700))
   Filter: (start_time >= '2017-10-01 00:00:00-04'::timestamp with 
time zone)
   ->  BitmapOr (actual rows=0 loops=1)
 ->  Bitmap Index Scan on eric_enodeb_201812_site_idx 
(actual rows=0 loops=1)
   Index Cond: (site_id < 1900)
 ->  Bitmap Index Scan on eric_enodeb_201812_site_idx 
(actual rows=14866 loops=1)
   Index Cond: (site_id > 2700)
 ->  Parallel Bitmap Heap Scan on eric_enodeb_201811 (actual rows=7204 
loops=2)
   Recheck Cond: ((site_id < 1900) OR (site_id > 2700))
   Filter: (start_time >= '2017-10-01 00:00:00-04'::timestamp with 
time zone)
   Heap Blocks: exact=7372
   ->  BitmapOr (actual rows=0 loops=1)
 ->  Bitmap Index Scan on eric_enodeb_201811_site_idx 
(actual rows=0 loops=1)
   Index Cond: (site_id < 1900)
 ->  Bitmap Index Scan on eric_enodeb_201811_site_idx 
(actual rows=14408 loops=1)
   Index Cond: (site_id > 2700)
 ->  Parallel Bitmap Heap Scan on eric_enodeb_201902 (actual rows=5128 
loops=1)
   Recheck Cond: ((site_id < 1900) OR (site_id > 2700))
   Filter: (start_time >= '2017-10-01 00:00:00-04'::timestamp with 
time zone)
   Heap Blocks: exact=3374
   ->  BitmapOr (actual rows=0 loops=1)
 ->  Bitmap Index Scan on eric_enodeb_201902_site_idx 
(actual rows=0 loops=1)
   Index Cond: (site_id < 1900)
 ->  Bitmap Index Scan on eric_enodeb_201902_site_idx 
(actual rows=5128 loops=1)
   Index Cond: (site_id > 2700)
 [...]

Justin



Re: Feature: temporary materialized views

2019-02-05 Thread Andreas Karlsson

On 2/5/19 6:56 PM, Andreas Karlsson wrote:
On 2/5/19 12:36 PM, Michael Paquier wrote:> - skipData is visibly always 
false.

 > We may want to keep skipData to have an assertion at the beginning of
 > inforel_startup for sanity purposes though.
This is not true in this version of the patch. The following two cases 
would crash if we add such an assertion:


EXPLAIN ANALYZE CREATE TABLE foo AS SELECT 1 WITH NO DATA;

and

PREPARE s AS SELECT 1;
CREATE TABLE bar AS EXECUTE s WITH NO DATA;

since they both still run the setup and tear down steps of the executor.

I guess that I could fix that for the second case as soon as I 
understand how much of the portal stuff can be skipped in 
ExecuteQuery(). But I am not sure what we should do with EXPLAIN ANALYZE 
... NO DATA. It feels like a contraction to me. Should we just raise an 
error? Or should we try to preserve the current behavior where you see 
something like the below?


In general I do not like how EXPLAIN CREATE TABLE AS uses a separate 
code path than CREATE TABLE AS, because we get weird but mostly harmless 
edge cases like the below and that I do not think that EXPLAIN ANALYZE 
CREATE MATERIALIZED VIEW sets the security context properly.


I am not sure if any of this is worth fixing, but it certainly 
contributed to why I thought that it was hard to design a good API.


postgres=# EXPLAIN ANALYZE CREATE TABLE IF NOT EXISTS bar AS SELECT 1;
 QUERY PLAN 



 Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.002 
rows=1 loops=1)

 Planning Time: 0.030 ms
 Execution Time: 12.245 ms
(3 rows)

Time: 18.223 ms
postgres=# EXPLAIN ANALYZE CREATE TABLE IF NOT EXISTS bar AS SELECT 1;
ERROR:  relation "bar" already exists
Time: 2.129 ms

Andreas



Add missing CREATE TABLE IF NOT EXISTS table_name AS EXECUTE query;

2019-02-05 Thread Andreas Karlsson

Hi,

The first example below works while the second one is a syntax error 
despite that the documentation 
(https://www.postgresql.org/docs/11/sql-createtableas.html) makes it 
seem like both should be valid. I do not see any reason to not support 
CTAS with both IF NOT EXISTS and EXECUTE at the same time so i am 
guessing that this was an oversight.


I have attached a patch which fixes this. What do you think? Should I 
add a new test case for this or is the change simple enough to not 
require any new test?


# CREATE TABLE IF NOT EXISTS a AS SELECT 1;
SELECT 1

# PREPARE q AS SELECT 1;
PREPARE
Time: 0.209 ms
foo=# CREATE TABLE IF NOT EXISTS a AS EXECUTE q;
ERROR:  syntax error at or near "EXECUTE"
LINE 1: CREATE TABLE IF NOT EXISTS a AS EXECUTE q;
^

Andreas


ctas-exec-ifne-v1.sql
Description: application/sql


Re: Bogus lateral-reference-propagation logic in create_lateral_join_info

2019-02-05 Thread Tom Lane
David Rowley  writes:
> On Wed, 6 Feb 2019 at 10:57, Tom Lane  wrote:
>> Also, I'd really like to know why I had to put in the exception seen
>> in the loop for AppendRelInfos that do not point to a valid parent.
>> It seems to me that that is almost certainly working around a bug in
>> the partitioning logic.  (Without it, the partition_prune regression test
>> crashes.)  Or would somebody like to own up to having created that state
>> of affairs intentionally?  If so why?

> Sounds pretty strange to me.   What exactly do you mean by not
> pointing to a valid parent? Do you mean the parent_relid index is not
> a valid RelOptInfo?

Exactly --- the parent_relid index does not have a RelOptInfo in
simple_rel_array[].

> Can you point to the regression test that's doing this?

Yeah, I misspoke above, it's in partition_join not partition_prune,
specifically

DELETE FROM prt1_l
WHERE EXISTS (
  SELECT 1
FROM int4_tbl,
 LATERAL (SELECT int4_tbl.f1 FROM int8_tbl LIMIT 2) ss
WHERE prt1_l.c IS NULL);

I didn't run this totally to bottom yet, but what seems to be
happening is that inheritance_planner is creating a partition-specific
subplan for the DELETE, and it's allowing AppendRelInfos from the
parent query to propagate into the subquery even though they have
nothing to do with that subquery.

We could just decide that it's okay for code dealing with the subquery
to ignore the irrelevant AppendRelInfos (which is basically what my
draft patch did), but I find that to be an uncomfortable answer: it
seems *way* too likely to result in code that can mask real bugs.
I'd be happier fixing things so that inheritance_planner doesn't
propagate anything into the subquery that doesn't make sense in the
subquery's context.  But perhaps that's unreasonably hard?  Not enough
data yet.

regards, tom lane



Re: Documentation and code don't agree about partitioned table UPDATEs

2019-02-05 Thread Amit Kapila
On Tue, Feb 5, 2019 at 2:14 PM David Rowley
 wrote:
>
> The docs in PG11 and master both state:
>
> When an UPDATE causes a row to move from one partition to another,
> there is a chance that another concurrent UPDATE or DELETE misses this
> row. Suppose session 1 is performing an UPDATE on a partition key, and
> meanwhile a concurrent session 2 for which this row is visible
> performs an UPDATE or DELETE operation on this row. Session 2 can
> silently miss the row if the row is deleted from the partition due to
> session 1's activity. In such case, session 2's UPDATE or DELETE,
> being unaware of the row movement thinks that the row has just been
> deleted and concludes that there is nothing to be done for this row.
> In the usual case where the table is not partitioned, or where there
> is no row movement, session 2 would have identified the newly updated
> row and carried out the UPDATE/DELETE on this new row version.
>
>
> Which was true when it was added by Robert in 2f178441044. However,
> f16241bef7c then added code to cause serialization failures when the
> update/delete process encountered a moved row.
>

I agree that the docs need to be updated and this patch should be
backpatched as well.  However, I think the older wording was more
descriptive and clear, so I have modified your patch a bit to retain
part of old wording, see the result as attached.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


partition_update_doc_fix_2.patch
Description: Binary data


Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2019-02-05 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Feb 5, 2019 at 2:08 PM Tom Lane  wrote:
>> I've got much of the code for it already (in the wreckage of my failed
>> attempts), so I'll go back and finish that up.  I was just waiting to see
>> how loudly people would howl about using object type as a condition for
>> figuring out what a pg_depend entry really means.  If we're okay with
>> that hack, I think I can make it work.

> Perhaps I've missed some subtlety, but I'm not sure that it's all that
> ugly. If splitting INTERNAL_AUTO into two new dependency types amounts
> to the same thing as what you suggest here, then what's the
> difference?

It's the same as long as we always think that the "real owner" of a
subsidiary object is of the same type as that object, eg that the
real owner of a per-partition trigger is the parent trigger, the real
owner of a per-partition index is the parent index, etc ... and that
there's only going to be one parent object of that type.

I don't immediately have a counterexample to this, which is why I feel
like this is an OK solution for now.  But I'm not sure it'll hold good
indefinitely.

Actually, the index case is already on the edge of being a problem:
both parents will be relations, and I suspect the code will have to
look at relkinds to identify which parent to consider the "real owner".

BTW, does anyone else feel like our terminology around partitions is
a dead loss?  I have not seen anybody use the word in a way that makes
a clear distinction between the parent "partitioned" object and the
child "partition" objects, at least not when it comes to subsidiary
objects like triggers.

regards, tom lane



Re: Non-deterministic IndexTuple toast compression from index_form_tuple() + amcheck false positives

2019-02-05 Thread Peter Geoghegan
On Fri, Feb 1, 2019 at 6:27 PM Peter Geoghegan  wrote:
> Attached draft patch fixes the bug by doing fairly simple
> normalization. I think that TOAST compression of datums in indexes is
> fairly rare in practice, so I'm not very worried about the fact that
> this won't perform as well as it could with indexes that have a lot of
> compressed datums. I think that the interface I've added might need to
> be expanded for other things in the future (e.g., to make amcheck work
> with nbtree-native duplicate compression), and not worrying about the
> performance too much helps with that goal.
>
> I'll pick this up next week, and likely commit a fix by Wednesday or
> Thursday if there are no objections. I'm not sure if the test case is
> worth including.

On second thought, the test should look like this:

$ psql --no-psqlrc --echo-queries -f bug_repro.sql
drop table if exists bug;
DROP TABLE
create table bug (buggy text);
CREATE TABLE
alter table bug alter column buggy set storage plain;
ALTER TABLE
create index toasty on bug(buggy);
CREATE INDEX
alter table bug alter column buggy set storage extended;
ALTER TABLE
insert into bug select repeat ('a', 2100);
INSERT 0 1
select bt_index_parent_check('toasty', true);
psql:bug_repro.sql:7: ERROR:  heap tuple (0,1) from table "bug" lacks
matching index tuple within index "toasty"

This relies on the fact that the pg_attribute entry for the index is
more or less a straight copy of the corresponding pg_attribute entry
for the table at the time of the CREATE INDEX. I arrange to make
storage of the index attribute plain and storage for the heap
attribute extended. TOAST is applied inconsistently between
toast_insert_or_update() and index_form_tuple() without really relying
on implementation details that are subject to change.

-- 
Peter Geoghegan



Re: PostgreSQL vs SQL/XML Standards

2019-02-05 Thread Chapman Flack
On 02/01/19 20:20, Michael Paquier wrote:
> On Thu, Jan 31, 2019 at 04:26:31PM +0100, Pavel Stehule wrote:
>> I'll mark this patch as ready for commiters.
> 
> For now I have moved the patch to the next CF, with the same status.

I wonder whether, given the move to next CF, it makes sense to change
the title of the CF entry from "XMLTABLE" to, more generically, XML
improvements, and get one or two more small changes in:

- get XMLPARSE(CONTENT... (and cast-to-xml with XMLOPTION=content) to
  succeed even for content with DTDs, so that the content subtype really
  does fully include the document subtype, aligning it with the SQL:2006+
  standard. I think this would be a simple patch that I can deliver early
  this month, and Tom found reports where the current behavior already
  bites people in pg_restore. Its only effect would be to allow a currently-
  failing case to succeed (and stop biting people).

- get XMLEXISTS and XMLTABLE to allow passing of named parameters. I have
  less of a sense of how difficult that might be, but I see that the libxml
  xpath API does allow them. I don't know whether they were left out of the
  original development just arbitrarily, or whether some effort was made
  and ran into some problem.

  The value of supporting the named parameters, especially when the library
  limits us to XPath 1.0, is that the XPath 1.0 limitation that a value
  passed as the context item can only be an XML 'document' only applies to
  the context item, not to named parameters. So if somebody is trying to
  port an expression ...'foo(.)' PASSING bar... and bar is not in document
  form, there's a simple rewrite to ...'foo($a) PASSING bar AS a...
  whereas if we can't do named parameters, frustration ensues.

Again, the only effect of such a change would be to permit something that
currently isn't possible. I don't think I can promise to work on a patch
for the second issue, but they both seem worthwhile, and I'm happy to
work on the first.

It seems to me these changes and the doc patch to go with them are
closely enough related to fit in one CF entry that's still smaller and
simpler than many, and that shouldn't be too difficult to review for v12,
but I'll defer to more experienced voices.

Regards,
-Chap



Re: add_partial_path() may remove dominated path but still in use

2019-02-05 Thread Kohei KaiGai
Hello,

Let me remind the thread again.
I'm waiting for the fix getting committed for a month...

2019年1月22日(火) 20:50 Kohei KaiGai :
>
> Let me remind the thread.
> If no more comments, objections, or better ideas, please commit this fix.
>
> Thanks,
>
> 2019年1月17日(木) 18:29 Kyotaro HORIGUCHI :
> >
> > Hello, sorry for the absence.
> >
> > At Fri, 11 Jan 2019 11:36:43 -0500, Robert Haas  
> > wrote in 
> > 
> > > On Thu, Jan 10, 2019 at 9:10 PM Kohei KaiGai  wrote:
> > > > 2019年1月11日(金) 5:52 Robert Haas :
> > > > > On Wed, Jan 9, 2019 at 12:44 AM Kohei KaiGai  
> > > > > wrote:
> > > > > > So, is it sufficient if set_rel_pathlist_hook is just relocated in
> > > > > > front of the generate_gather_paths?
> > > > > > If we have no use case for the second hook, here is little necessity
> > > > > > to have the post_rel_pathlist_hook() here.
> > > > > > (At least, PG-Strom will use the first hook only.)
> > > > >
> > > > > +1.  That seems like the best way to be consistent with the principle
> > > > > that we need to have all the partial paths before generating any
> > > > > Gather paths.
> > > > >
> > > > Patch was updated, just for relocation of the set_rel_pathlist_hook.
> > > > Please check it.
> > >
> > > Seems reasonable to me.
> >
> > Also seems reasonable to me.  The extension can call
> > generate_gather_paths redundantly as is but it almost doesn't
> > harm, so it is acceptable even in a minor release.
> >
> > regards.
> >
> > --
> > Kyotaro Horiguchi
> > NTT Open Source Software Center
> >
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei 

-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


pgsql-bugfix-relocation-path-set_rel_pathlist_hook.v11.patch
Description: Binary data


pgsql-bugfix-relocation-path-set_rel_pathlist_hook.v10.patch
Description: Binary data


Re: pg11.1: dsa_area could not attach to segment

2019-02-05 Thread Thomas Munro
On Wed, Feb 6, 2019 at 1:10 PM Justin Pryzby  wrote:
> This is a contrived query which I made up to try to exercise/stress bitmap
> scans based on Thomas's working hypothesis for this error/bug.  This seems to
> be easier to hit than the other error ("could not attach to segment") - a loop
> around this query has run into "free pages" several times today.

Thanks.  I'll go and try to repro this with queries that look like that.

It's possibly interesting that you're running on VMWare (as mentioned
in an off-list email), though I haven't got a specific theory about
why that'd be relevant.  I suppose it could be some kind of cache
coherency bug that is more likely there for whatever reason.  I've
being trying to repro on a laptop and a couple of bare metal servers.
Can anyone else who has hit this comment on any virtualisation they
might be using?

-- 
Thomas Munro
http://www.enterprisedb.com



RE: Cache relation sizes?

2019-02-05 Thread Ideriha, Takeshi
>From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]

>On the other hand, the simplest method I thought that could also work is to 
>only cache
>the file size (nblock) in shared memory, not in the backend process, since 
>both nblock
>and relsize_change_counter are uint32 data type anyway. If 
>relsize_change_counter
>can be changed without lock, then nblock can be changed without lock, is it 
>right? In
>that case, nblock can be accessed directly in shared memory. In this case, is 
>the
>relation size necessary to be cached in backend?

(Aside from which idea is better.. )
If you want to put relation size on the shared memory, then I don't think 
caches in backend 
is necessary because every time relation_size is updated you need to invalidate 
cache 
in backends. At the reference taking shared lock on the cache and at the update 
taking 
exclusive lock is simple without backend cache. 

>(2) Is the MdSharedData temporary or permanent in shared memory?
That data structure seems to initialize at the time of InitPostgre, which means 
it's permanent
because postgres-initialized-shared-memory doesn't have a chance to drop it as 
far as I know.
(If you want to use temporary data structure, then other mechanism like 
dsm/dsa/dshash is a candidate.)

Regards,
Takeshi Ideriha


Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Kyotaro HORIGUCHI
At Tue, 5 Feb 2019 02:40:35 +, "Tsunakawa, Takayuki" 
 wrote in 
<0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
> From: br...@momjian.us [mailto:br...@momjian.us]
> > On Mon, Feb 4, 2019 at 08:23:39AM +, Tsunakawa, Takayuki wrote:
> > > Horiguchi-san, Bruce, all, So, why don't we make
> > > syscache_memory_target the upper limit on the total size of all
> > > catcaches, and rethink the past LRU management?
> > 
> > I was going to say that our experience with LRU has been that the
> > overhead is not worth the value, but that was in shared resource cases,
> > which this is not.
> 
> That's good news!  Then, let's proceed with the approach involving LRU, 
> Horiguchi-san, Ideriha-san.

If you mean accessed-time-ordered list of entries by "LRU", I
still object to involve it since it is too complex in searching
code paths. Invalidation would make things more complex. The
current patch sorts entries by ct->lastaccess and discards
entries not accessed for more than threshold, only at doubling
cache capacity. It is already a kind of LRU in behavior.

This patch intends not to let caches bloat by unnecessary
entries, which is negative ones at first, then less-accessed ones
currently. If you mean by "LRU" something to put a hard limit on
the number or size of a catcache or all caches, it would be
doable by adding sort phase before pruning, like
CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
attched) as food for discussion.

With the second attached script, we can observe what is happening
from another session by the following query.

select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' 
pg_statistic'::regclass;

> pg_statistic | 1041024 |7109 | 
> {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0

On the other hand, differently from the original pruning, this
happens irrelevantly to hash resize so it will causes another
observable intermittent slowdown than rehashing.

The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 21f7b5528be03274dae9e58690c35cee9e68c82f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml  |  38 ++
 src/backend/access/transam/xact.c |   5 +
 src/backend/utils/cache/catcache.c| 166 --
 src/backend/utils/misc/guc.c  |  23 
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h  |  28 -
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
   
  
 
+ 
+  syscache_memory_target (integer)
+  
+   syscache_memory_target configuration parameter
+  
+  
+  
+   
+Specifies the maximum amount of memory to which syscache is expanded
+without pruning. The value defaults to 0, indicating that pruning is
+always considered. After exceeding this size, syscache pruning is
+considered according to
+. If you need to keep
+certain amount of syscache entries with intermittent usage, try
+increase this setting.
+   
+  
+ 
+
+ 
+  syscache_prune_min_age (integer)
+  
+   syscache_prune_min_age configuration parameter
+  
+  
+  
+   
+Specifies the minimum amount of unused time in seconds at which a
+syscache entry is considered to be removed. -1 indicates that syscache
+pruning is disabled at all. The value defaults to 600 seconds
+(10 minutes). The syscache entries that are not
+used for the duration can be removed to prevent syscache bloat. This
+behavior is suppressed until the size of syscache exceeds
+.
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
 	if (!IsParallelWorker())
+	{
 		stmtStartTimestamp = GetCurrentTimestamp();
+
+		/* Set this timestamp as aproximated current time */
+		SetCatCacheClock(stmtStartTimestamp);
+	}
 	else
 		Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcac

Re: Protect syscache from bloating with negative cache entries

2019-02-05 Thread Kyotaro HORIGUCHI
At Wed, 06 Feb 2019 14:43:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190206.144334.193118280.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 5 Feb 2019 02:40:35 +, "Tsunakawa, Takayuki" 
>  wrote in 
> <0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
> > From: br...@momjian.us [mailto:br...@momjian.us]
> > > On Mon, Feb 4, 2019 at 08:23:39AM +, Tsunakawa, Takayuki wrote:
> > > > Horiguchi-san, Bruce, all, So, why don't we make
> > > > syscache_memory_target the upper limit on the total size of all
> > > > catcaches, and rethink the past LRU management?
> > > 
> > > I was going to say that our experience with LRU has been that the
> > > overhead is not worth the value, but that was in shared resource cases,
> > > which this is not.
> > 
> > That's good news!  Then, let's proceed with the approach involving LRU, 
> > Horiguchi-san, Ideriha-san.
> 
> If you mean accessed-time-ordered list of entries by "LRU", I
> still object to involve it since it is too complex in searching
> code paths. Invalidation would make things more complex. The
> current patch sorts entries by ct->lastaccess and discards
> entries not accessed for more than threshold, only at doubling
> cache capacity. It is already a kind of LRU in behavior.
> 
> This patch intends not to let caches bloat by unnecessary
> entries, which is negative ones at first, then less-accessed ones
> currently. If you mean by "LRU" something to put a hard limit on
> the number or size of a catcache or all caches, it would be
> doable by adding sort phase before pruning, like
> CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
> attched) as food for discussion.
> 
> With the second attached script, we can observe what is happening
> from another session by the following query.
> 
> select relname, size, ntuples, ageclass from pg_stat_syscache where relname 
> =' pg_statistic'::regclass;
> 
> > pg_statistic | 1041024 |7109 | 
> > {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0
> 
> On the other hand, differently from the original pruning, this
> happens irrelevantly to hash resize so it will causes another
> observable intermittent slowdown than rehashing.
> 
> The two should have the same extent of impact on performance when
> disabled. I'll take numbers briefly using pgbench.

Sorry, I forgot to consider references in the previous patch, and
attach the test script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 952497a1fad57ac49e0b772a147201aa31065183 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml  |  38 ++
 src/backend/access/transam/xact.c |   5 +
 src/backend/utils/cache/catcache.c| 164 --
 src/backend/utils/misc/guc.c  |  23 
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h  |  28 -
 6 files changed, 252 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
   
  
 
+ 
+  syscache_memory_target (integer)
+  
+   syscache_memory_target configuration parameter
+  
+  
+  
+   
+Specifies the maximum amount of memory to which syscache is expanded
+without pruning. The value defaults to 0, indicating that pruning is
+always considered. After exceeding this size, syscache pruning is
+considered according to
+. If you need to keep
+certain amount of syscache entries with intermittent usage, try
+increase this setting.
+   
+  
+ 
+
+ 
+  syscache_prune_min_age (integer)
+  
+   syscache_prune_min_age configuration parameter
+  
+  
+  
+   
+Specifies the minimum amount of unused time in seconds at which a
+syscache entry is considered to be removed. -1 indicates that syscache
+pruning is disabled at all. The value defaults to 600 seconds
+(10 minutes). The syscache entries that are not
+used for the duration can be removed to prevent syscache bloat. This
+behavior is suppressed until the size of syscache exceeds
+.
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7

RE: Cache relation sizes?

2019-02-05 Thread Tsunakawa, Takayuki
From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> On the other hand, the simplest method I thought that could also work is
> to only cache the file size (nblock) in shared memory, not in the backend
> process, since both nblock and relsize_change_counter are uint32 data type
> anyway. If relsize_change_counter can be changed without lock, then nblock
> can be changed without lock, is it right? In that case, nblock can be accessed
> directly in shared memory. In this case, is the relation size necessary
> to be cached in backend?

Although I haven't looked deeply at Thomas's patch yet, there's currently no 
place to store the size per relation in shared memory.  You have to wait for 
the global metacache that Ideriha-san is addressing.  Then, you can store the 
relation size in the RelationData structure in relcache.



> (2) Is the MdSharedData temporary or permanent in shared memory?
> from the patch:
>  typedef struct MdSharedData
>  {
>   /* XXX could have an array of these, and use rel OID % nelements?
> */
>   pg_atomic_uint32relsize_change_counter;
>  } MdSharedData;
> 
>  static MdSharedData *MdShared;

Permanent in shared memory.


Regards
Takayuki Tsunakawa



A separate table level option to control compression

2019-02-05 Thread Pavan Deolasee
Hello,

Currently either the table level option `toast_tuple_target` or the compile
time default `TOAST_TUPLE_TARGET` is used to decide whether a new tuple
should be compressed or not. While this works reasonably well for most
situations, at times the user may not want to pay the overhead of toasting,
yet take benefits of inline compression.

I would like to propose a new table level option, compress_tuple_target,
which can be set independently of toast_tuple_target, and is checked while
deciding whether to compress the new tuple or not.

For example,

CREATE TABLE compresstest250 (a int, b text) WITH (compress_tuple_target =
250);
CREATE TABLE compresstest2040 (a int, b text) WITH (compress_tuple_target =
2040);

-- shouldn't get compressed nor toasted
INSERT INTO compresstest250 VALUES (1, repeat('1234567890',20));

-- should get compressed, but not toasted
INSERT INTO compresstest250 VALUES (2, repeat('1234567890',30));

-- shouldn't get compressed nor toasted
INSERT INTO compresstest2040 VALUES (1, repeat('1234567890',20));
INSERT INTO compresstest2040 VALUES (2, repeat('1234567890',30));

Without this patch, the second INSERT will not compress the tuple since its
length is less than the toast threshold. With the patch and after setting
table level option, one can compress such tuples.

The attached patch implements this idea.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 22dbc07b23..149d6dd28f 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1310,6 +1310,27 @@ WITH ( MODULUS numeric_literal, REM
 

 
+   
+compress_tuple_target (integer)
+
+ 
+ The compress_tuple_target specifies the minimum tuple length required before
+ we try to compress columns marked as Extended or Main and applies only to
+ new tuples - there is no effect on existing rows.
+ By default this parameter is set to allow at least 4 tuples per block,
+ which with the default blocksize will be 2040 bytes. Valid values are
+ between 128 bytes and the (blocksize - header), by default 8160 bytes.
+ If the value is set to a value greater than
+ toast_tuple_target, then that will be ignored and the value
+ of toast_tuple_target will be used instead.
+ Note that the default setting is often close to optimal, and
+ it is possible that setting this parameter could have negative
+ effects in some cases.
+ This parameter cannot be set for TOAST tables.
+ 
+
+   
+

 parallel_workers (integer)
 
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index cdf1f4af62..db2f875721 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -300,6 +300,15 @@ static relopt_int intRelOpts[] =
 		},
 		TOAST_TUPLE_TARGET, 128, TOAST_TUPLE_TARGET_MAIN
 	},
+	{
+		{
+			"compress_tuple_target",
+			"Sets the target tuple length at which columns will be compressed",
+			RELOPT_KIND_HEAP,
+			ShareUpdateExclusiveLock
+		},
+		TOAST_TUPLE_TARGET, 128, TOAST_TUPLE_TARGET_MAIN
+	},
 	{
 		{
 			"pages_per_range",
@@ -1379,6 +1388,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
 		{"toast_tuple_target", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, toast_tuple_target)},
+		{"compress_tuple_target", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, compress_tuple_target)},
 		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_scale_factor)},
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..ec48b25b6c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2351,6 +2351,9 @@ static HeapTuple
 heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 	CommandId cid, int options)
 {
+	int		toast_tuple_threshold;
+	int		compress_tuple_threshold;
+
 	/*
 	 * Parallel operations are required to be strictly read-only in a parallel
 	 * worker.  Parallel inserts are not safe even in the leader in the
@@ -2375,6 +2378,12 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 	HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */
 	tup->t_tableOid = RelationGetRelid(relation);
 
+	toast_tuple_threshold = RelationGetToastTupleTarget(relation,
+TOAST_TUPLE_THRESHOLD);
+	compress_tuple_threshold = RelationGetCompressTupleTarget(relation,
+toast_tuple_threshold);
+	compress_tuple_threshold = Min(compress_tuple_threshold,
+   toast_tuple_threshold);
 	/*
 	 * If the new tuple is too big for storage or contain