Re: Experimenting with hash join prefetch

2018-10-14 Thread Andrey Borodin


Hi, Thomas!
> 14 окт. 2018 г., в 9:18, Thomas Munro  
> написал(а):
> 
> +   /* Prefetch the bucket for the next key */
> +   uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
> +   uint32 next_bucket = next_hash % hashtable->nbuckets;
> +   __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);


+1, I also think that we should use __builtin_prefetch these days (years, 
actually).
Exactly after listening Anastassia Ailamaki's (author of referenced paper) talk 
on VLDB I've proposed to do that for B-tree [0], but did not really pursuit 
that idea.

[0] 
https://www.postgresql.org/message-id/3B774C9E-01E8-46A7-9642-7830DC1108F1%40yandex-team.ru


Re: B-tree cache prefetches

2018-10-14 Thread Andrey Borodin
Hello!

> 31 авг. 2018 г., в 2:40, Thomas Munro  
> написал(а):
> A related topic is the cache-unfriendliness of traditional binary
> searches of sorted data.  Check out "Array Layouts for
> Comparison-Based Searching"[1] if you haven't already.  It says that
> if your array fits in L2 cache, as our btree pages do, then binary
> search still wins against Eytzinger et al, which I was a bit
> disappointed 
I've re-read that paper. Their results are not about our case, here's a quote:
> For arrays small enough _to be kept_ in L2 cache, the branch-free binary 
> search code listed in Listing 2 is the fastest algorithm

Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic 
possibility, except one simple problem: I do not see how can we insert items 
into this layout.

Also, that research is quite distant from B-tree binsearch: thier data items 
are small and do not represent reference to actual data. Eytzinger exploits the 
fact that two posiible future keys share same cache line. But it is hard to 
provide, given we place items at quite random place of the page.

Best regards, Andrey Borodin.

Re: Experimenting with hash join prefetch

2018-10-14 Thread Dmitry Dolgov
> On Sun, 14 Oct 2018 at 06:19, Thomas Munro  
> wrote:
>
> Cache-oblivious hash joins cause a lot of TLB and cache misses.
> ...
> (There is another class of cache-aware hash join algorithms that partition
> carefully up front to avoid them; that's not us.)

Just out of curiosity, can you please elaborate more on this part (with
references)? I'm thinking about this topic for a while, and I'm wondering, if
by another class you mean something like this [1], then even if it's not us
today, are there any issues that prevent from experimenting in this area?

[1]: https://www.cse.ust.hk/catalac/papers/coqp_tods08.pdf



Re: Experimenting with hash join prefetch

2018-10-14 Thread Thomas Munro
On Mon, Oct 15, 2018 at 12:16 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Sun, 14 Oct 2018 at 06:19, Thomas Munro  
> > wrote:
> > Cache-oblivious hash joins cause a lot of TLB and cache misses.
> > ...
> > (There is another class of cache-aware hash join algorithms that partition
> > carefully up front to avoid them; that's not us.)
>
> Just out of curiosity, can you please elaborate more on this part (with
> references)? I'm thinking about this topic for a while, and I'm wondering, if
> by another class you mean something like this [1], then even if it's not us
> today, are there any issues that prevent from experimenting in this area?

Hmm, I didn't mean the term-of-art "cache-oblivious" (Wikipedia
definition: "an algorithm designed to take advantage of a CPU cache
without having the size of the cache"), I meant not "cache-conscious"
(we don't do anything at all to reduce cache misses, though obviously
we could add prefetching to improve on that).

The distinction I'm making is between "no partition" hash join (what
we have), where you don't have to do any work up front, but you pay
for a lot of cache misses during building and probing, and "partition"
hash join, notably "radix join" (what MonetDB has?), where you have a
partition phase before the build phase that aims to break the data up
into enough partitions so that the hash tables will fit better in
cache, making the later phases faster.  There seems to be some
disagreement about which is best -- passing through the data first is
expensive, but so are cache misses on every probe, and there are
claims that the winner depends on skew and tuple size.

Here's some of the stuff I read/watched about this subject:
https://15721.courses.cs.cmu.edu/spring2018/schedule.html#apr-04-2018
Add to that 
http://www.adms-conf.org/2017/camera-ready/Analyzing_In_Memory_Hash_Joins__Granularity_Matters.pdf.

Skimming very quickly through the paper you posted, yeah, I mean
exactly that stuff.  Specifically I was thinking of the radix join
mentioned in background section 2.3.  (I see also that the same
authors wrote a paper "Cache-Oblivious Hash Joins" which appears to
describe a refinement of radix that doesn't need to be parameterised
for cache size.)  Sure, we could always consider these ideas.  I am
not personally working on that; to me it looked very hard to build,
for a feature so uncertain to produce better results!

(Note: we do of course have some kind of partitioning called
"batching" when work_mem runs out, but it's not a kind of partitioning
that cares about reducing cache misses, so if I understood correctly
it's still "no partition" as far as this discussion goes.)

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Add TAP tests for pg_verify_checksums

2018-10-14 Thread Andrew Dunstan




On 10/13/2018 09:59 PM, Michael Paquier wrote:

On Sat, Oct 13, 2018 at 05:53:00PM -0400, Andrew Dunstan wrote:

It occurred to me that a pretty simple fix could just be to blacklist
everything that didn't start with a digit. The whitelist approach is
probably preferable... depends how urgent we see this as.

Yeah, possibly, still that's not a correct long-term fix.  So attached
is what I think we should do for HEAD and REL_11_STABLE with an approach
using a whitelist.  I have added positive and negative tests on top of
the existing TAP tests, as suggested by Michael B, and I made the code
use relpath.h to make sure that we don't miss any fork types.

Any objections to this patch?




This code now seems redundant:

 if (strcmp(fn, ".") == 0 ||
     strcmp(fn, "..") == 0)
     return true;


I would probably reverse the order of these two tests. It might not make 
any difference, assuming fn is never an empty string, but it seems more 
logical to me.


   +    /* good to go if only digits */
   +    if (fn[pos] == '\0')
   +        return false;
   +    /* leave if no digits */
   +    if (pos == 0)
   +        return true;

It also looks to me like the check for a segment number doesn't ensure there is at least 
one digit, so "123." might pass, but I could be wrong. In any case, there isn't 
a test for that, and there probably should be.

cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




rowcount estimate varies WRT partitionwise_join

2018-10-14 Thread Justin Pryzby
I was crosseyed yesterday due to merge conflicts, but this still seems odd.

I thought that final row counts would not vary with the details of the chosen
plan.  Which seems to hold true when I disable parallel join or hash join, but
not for PWJ.

I noticed this behavior while joining our own tables using eq join on the
partition key plus an inequality comparison also on the partition key (range),
but I see the same thing using tables from the regression test:

pryzbyj=# EXPLAIN SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE 
t1.a = t2.b AND t1.b = 0 ;
  QUERY PLAN  
--
 Hash Join  (cost=6.96..13.83 rows=12 width=18)
   Hash Cond: (t2.b = t1.a)
   ->  Append  (cost=0.00..6.00 rows=200 width=9)
 ->  Seq Scan on prt2_p1 t2  (cost=0.00..1.84 rows=84 width=9)
 ->  Seq Scan on prt2_p2 t2_1  (cost=0.00..1.83 rows=83 width=9)
 ->  Seq Scan on prt2_p3 t2_2  (cost=0.00..1.33 rows=33 width=9)
   ->  Hash  (cost=6.81..6.81 rows=12 width=9)
 ->  Append  (cost=0.00..6.81 rows=12 width=9)
   ->  Seq Scan on prt1_p1 t1  (cost=0.00..2.56 rows=5 width=9)
 Filter: (b = 0)
   ->  Seq Scan on prt1_p2 t1_1  (cost=0.00..2.56 rows=5 width=9)
 Filter: (b = 0)
   ->  Seq Scan on prt1_p3 t1_2  (cost=0.00..1.62 rows=2 width=9)
 Filter: (b = 0)

pryzbyj=# SET enable_partitionwise_join=on;
pryzbyj=# EXPLAIN SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE 
t1.a = t2.b AND t1.b = 0 ;
  QUERY PLAN  
--
 Append  (cost=2.62..12.75 rows=7 width=18)
   ->  Hash Join  (cost=2.62..4.81 rows=3 width=18)
 Hash Cond: (t2.b = t1.a)
 ->  Seq Scan on prt2_p1 t2  (cost=0.00..1.84 rows=84 width=9)
 ->  Hash  (cost=2.56..2.56 rows=5 width=9)
   ->  Seq Scan on prt1_p1 t1  (cost=0.00..2.56 rows=5 width=9)
 Filter: (b = 0)
   ->  Hash Join  (cost=2.62..4.80 rows=3 width=18)
 Hash Cond: (t2_1.b = t1_1.a)
 ->  Seq Scan on prt2_p2 t2_1  (cost=0.00..1.83 rows=83 width=9)
 ->  Hash  (cost=2.56..2.56 rows=5 width=9)
   ->  Seq Scan on prt1_p2 t1_1  (cost=0.00..2.56 rows=5 width=9)
 Filter: (b = 0)
   ->  Hash Join  (cost=1.65..3.11 rows=1 width=18)
 Hash Cond: (t2_2.b = t1_2.a)
 ->  Seq Scan on prt2_p3 t2_2  (cost=0.00..1.33 rows=33 width=9)
 ->  Hash  (cost=1.62..1.62 rows=2 width=9)
   ->  Seq Scan on prt1_p3 t1_2  (cost=0.00..1.62 rows=2 width=9)
 Filter: (b = 0)



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-14 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-12 19:47:40 -0400, Tom Lane wrote:
>> So I went looking for a different example to plug in there, and soon
>> found that there weren't any.  If you change all the physically_coercible
>> calls in that script to binary_coercible, its output doesn't change.
>> I'm thinking that we ought to do that, and just get rid of
>> physically_coercible(), so that we have a tighter, more semantically
>> meaningful set of checks here.  We can always undo that if we ever
>> have occasion to type-cheat like that again, but offhand I'm not sure
>> why we would do so.

> Hm, I wonder if it's not a good idea to leave the test there, or rewrite
> it slightly, so we have a a more precise warning about cheats like that?

After thinking about this for awhile, I decided that
physically_coercible() is poorly named, because it suggests that it
might for instance allow any 4-byte type to be cast to any other one.
Actually the additional cases it allows are just ones where an explicit
binary-coercion cast would be needed.  So I still think we should
tighten up the tests while we can, but I left that function in place
with a different name and a better comment.

regards, tom lane



Re: rowcount estimate varies WRT partitionwise_join

2018-10-14 Thread Tom Lane
Justin Pryzby  writes:
> I was crosseyed yesterday due to merge conflicts, but this still seems odd.
> I thought that final row counts would not vary with the details of the chosen
> plan.  Which seems to hold true when I disable parallel join or hash join, but
> not for PWJ.

There are good reasons why that's not enabled by default yet ;-)

regards, tom lane



Re: WIP: Avoid creation of the free space map for small tables

2018-10-14 Thread John Naylor
> On 10/13/18, Amit Kapila  wrote:
>> I think you have found a good way to avoid creating FSM, but can't we
>> use some simpler technique like if the FSM fork for a relation doesn't
>> exist, then check the heapblk number for which we try to update the
>> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
>> creating the FSM.

>> I think it would be better if we can find a common way to avoid
>> creating FSM both during DO and REDO time.  It might be possible if
>> somethin like what I have said above is feasible.

I've attached v4, which implements the REDO case, and as closely as
possible to the DO case. I've created a new function to guard against
creation of the FSM, which is called by  RecordPageWithFreeSpace() and
RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace()
takes a relfilenode and not a relation, I had to reimplement that
separately, but the logic is basically the same. It works under
streaming replication.

I've also attached a couple SQL scripts which, when the aforementioned
DEBUG1 calls are enabled, show what the heap insert code is doing for
different scenarios. Make check-world passes.

-John Naylor
From f3abe9fc003430530db71ddd40c4e9bed8a38513 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sun, 14 Oct 2018 22:35:12 +0700
Subject: [PATCH v4] Avoid creation of the free space map for small tables.

The FSM isn't created if the heap has fewer than 10 blocks. If the last
known good block has insufficient space, try every block before extending
the heap.

If a heap with a FSM is truncated back to below the threshold, the FSM
stays around and can be used as usual.
---
 contrib/pageinspect/expected/page.out |  77 ++--
 contrib/pageinspect/sql/page.sql  |  33 --
 src/backend/access/heap/hio.c | 137 +-
 src/backend/storage/freespace/freespace.c |  61 +-
 src/include/storage/freespace.h   |   4 +
 5 files changed, 229 insertions(+), 83 deletions(-)

diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 3fcd9fbe6d..83e5910453 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,48 +1,69 @@
 CREATE EXTENSION pageinspect;
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
-VACUUM test1;  -- set up FSM
+CREATE TABLE test_rel_forks (a int);
+-- Make sure there are enough blocks in the heap for the FSM to be created.
+INSERT INTO test_rel_forks SELECT g from generate_series(1,1) g;
+-- set up FSM and VM
+VACUUM test_rel_forks;
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
-SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
  main_0 
 
8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test1', 'main', 1)) AS main_1;
-ERROR:  block number 1 is out of range for relation "test1"
-SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
+ERROR:  block number 100 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
  fsm_0 
 ---
   8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
- fsm_1 

-  8192
-(1 row)
-
-SELECT octet_length(get_raw_page('test1', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
+ERROR:  block number 10 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
  vm_0 
 --
  8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test1', 'vm', 1)) AS vm_1;
-ERROR:  block number 1 is out of range for relation "test1"
+SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;
+ERROR:  block number 1 is out of range for relation "test_rel_forks"
 SELECT octet_length(get_raw_page('xxx', 'main', 0));
 ERROR:  relation "xxx" does not exist
-SELECT octet_length(get_raw_page('test1', 'xxx', 0));
+SELECT octet_length(get_raw_page('test_rel_forks', 'xxx', 0));
 ERROR:  invalid fork name
 HINT:  Valid fork names are "main", "fsm", "vm", and "init".
-SELECT get_raw_page('test1', 0) = get_raw_page('test1', 'main', 0);
+SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
+ fsm_page_contents 
+---
+ 0: 192   +
+ 1: 192   +
+ 3: 192   +
+ 7: 192   +
+ 15: 192  +
+ 31: 192  +
+ 63: 192  +
+ 127: 192 +
+ 255: 192 +
+ 511: 192 +
+ 1023: 192+
+ 2047: 192+
+ 4095: 192+
+ fp_next_slot: 0  +
+ 
+(1 row)
+
+SELECT get_raw_page('test_rel_forks', 0) = get_raw_page('test_rel_forks', 'main', 0);
  ?column? 
 --
  t
 (1 row)
 
+DROP TABLE test_rel_forks;
+CREATE TABLE test

Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2018-10-14 Thread Matheus de Oliveira
On Tue, Oct 2, 2018 at 3:40 AM Michael Paquier  wrote:

>
> The last patch set does not apply, so this is moved to next CF, waiting
> on author as new status.
>

Updated the last patch so it can apply cleanly on HEAD.

About the bugfixes, do you think it is better to move to another thread?

Best regards,
-- 
Matheus de Oliveira
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index f13a6cd944..5910680cf3 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -55,7 +55,9 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
 ADD table_constraint [ NOT VALID ]
 ADD table_constraint_using_index
-ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+ALTER CONSTRAINT constraint_name
+  [ ON DELETE action ] [ ON UPDATE action ]
+  [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 VALIDATE CONSTRAINT constraint_name
 DROP CONSTRAINT [ IF EXISTS ]  constraint_name [ RESTRICT | CASCADE ]
 DISABLE TRIGGER [ trigger_name | ALL | USER ]
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e112b4ef4..86dabc9bbc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7843,8 +7843,43 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
  errmsg("constraint \"%s\" of relation \"%s\" is not a foreign key constraint",
 		cmdcon->conname, RelationGetRelationName(rel;
 
+	/*
+	 * Verify for FKCONSTR_ACTION_UNKNOWN, if found, replace by current
+	 * action. We could handle FKCONSTR_ACTION_UNKNOWN bellow, but since
+	 * we already have to handle the case of changing to the same action,
+	 * seems simpler to simple replace FKCONSTR_ACTION_UNKNOWN by the
+	 * current action here.
+	 */
+	if (cmdcon->fk_del_action == FKCONSTR_ACTION_UNKNOWN)
+		cmdcon->fk_del_action = currcon->confdeltype;
+
+	if (cmdcon->fk_upd_action == FKCONSTR_ACTION_UNKNOWN)
+		cmdcon->fk_upd_action = currcon->confupdtype;
+
+	/*
+	 * Do the same for deferrable attributes. But consider that if changed
+	 * only initdeferred attribute and to true, force deferrable to be also
+	 * true. On the other hand, if changed only deferrable attribute and to
+	 * false, force initdeferred to be also false.
+	 */
+	if (!cmdcon->was_deferrable_set)
+		cmdcon->deferrable = cmdcon->initdeferred ? true : currcon->condeferrable;
+
+	if (!cmdcon->was_initdeferred_set)
+		cmdcon->initdeferred = !cmdcon->deferrable ? false : currcon->condeferred;
+
+	/*
+	 * This is a safe check only, should never happen here.
+	 */
+	if (cmdcon->initdeferred && !cmdcon->deferrable)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("constraint declared INITIALLY DEFERRED must be DEFERRABLE")));
+
 	if (currcon->condeferrable != cmdcon->deferrable ||
-		currcon->condeferred != cmdcon->initdeferred)
+		currcon->condeferred != cmdcon->initdeferred ||
+		currcon->confdeltype != cmdcon->fk_del_action ||
+		currcon->confupdtype != cmdcon->fk_upd_action)
 	{
 		HeapTuple	copyTuple;
 		HeapTuple	tgtuple;
@@ -7862,6 +7897,8 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
 		copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple);
 		copy_con->condeferrable = cmdcon->deferrable;
 		copy_con->condeferred = cmdcon->initdeferred;
+		copy_con->confdeltype = cmdcon->fk_del_action;
+		copy_con->confupdtype = cmdcon->fk_upd_action;
 		CatalogTupleUpdate(conrel, ©Tuple->t_self, copyTuple);
 
 		InvokeObjectPostAlterHook(ConstraintRelationId,
@@ -7898,23 +7935,106 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
 otherrelids = list_append_unique_oid(otherrelids,
 	 tgform->tgrelid);
 
-			/*
-			 * Update deferrability of RI_FKey_noaction_del,
-			 * RI_FKey_noaction_upd, RI_FKey_check_ins and RI_FKey_check_upd
-			 * triggers, but not others; see createForeignKeyTriggers and
-			 * CreateFKCheckTrigger.
-			 */
-			if (tgform->tgfoid != F_RI_FKEY_NOACTION_DEL &&
-tgform->tgfoid != F_RI_FKEY_NOACTION_UPD &&
-tgform->tgfoid != F_RI_FKEY_CHECK_INS &&
-tgform->tgfoid != F_RI_FKEY_CHECK_UPD)
-continue;
-
 			copyTuple = heap_copytuple(tgtuple);
 			copy_tg = (Form_pg_trigger) GETSTRUCT(copyTuple);
 
+			/*
+			 * Set deferrability here, but note that it may be overridden bellow
+			 * if the pg_trigger entry is on the referencing table and depending
+			 * on the action used for ON UPDATE/DELETE. But for check triggers
+			 * (in the referenced table) it is kept as is (since ON
+			 * UPDATE/DELETE actions makes no difference for the check
+			 * triggers).
+			 */
 			copy_tg->tgdeferrable = cmdcon->deferrable;
 			copy_tg->tginitdeferred = cmdcon->initdeferred;
+
+			/*
+			 * Set ON DELETE action
+			 */
+			if (tgform->tgfoid == F_RI_FKEY_NOACTION_DEL ||
+tgform->tgfoid == F_RI_FKEY_RESTRICT_DEL ||
+tgfor

Re: B-tree cache prefetches

2018-10-14 Thread Thomas Munro
On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin  wrote:
> 31 авг. 2018 г., в 2:40, Thomas Munro  
> написал(а):
> A related topic is the cache-unfriendliness of traditional binary
> searches of sorted data.  Check out "Array Layouts for
> Comparison-Based Searching"[1] if you haven't already.  It says that
> if your array fits in L2 cache, as our btree pages do, then binary
> search still wins against Eytzinger et al, which I was a bit
> disappointed
>
> I've re-read that paper. Their results are not about our case, here's a quote:
> > For arrays small enough _to be kept_ in L2 cache, the branch-free binary 
> > search code listed in Listing 2 is the fastest algorithm
>
> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic 
> possibility, except one simple problem: I do not see how can we insert items 
> into this layout.

I don't know.  Aren't we talking about binary search within one 8KB
page here, anyway?

Thinking about some other ideas for cache prefetching in btree code,
here's an idea to improve pipelining of index scans (a bit like the
two-step pipeline idea from the hash join prefetch thread).  We know
the TIDs of heap tuples we'll be looking up in the near future, so can
we prefetch all the way to the tuple?   There are three random
accesses that follow soon after reading an index tuple:
BufTableLookup(), then page header, then the tuple itself.  At the
logical extreme, we could anticipate the probe of SharedBufHash for
the TID from index tuple i + 3 (perhaps dynahash could expose a
hash_prefetch_for_hash() facility), and anticipate the read of the
page header for the tuple pointed to by index tuple i + 2 (perhaps a
dirty read of SharedBufHash that is only prepared to look at the first
entry in the bucket would be OK for this), and anticipate the read of
the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
the page item table).  Or some less ambitious subset, if that pipeline
turns out to be too deep; it seems like the simplest experiment would
be to try prefetching just SharedBufHash lookups.  This may all be
completely crazy, I haven't tried any of it.

> Also, that research is quite distant from B-tree binsearch: thier data items 
> are small and do not represent reference to actual data. Eytzinger exploits 
> the fact that two posiible future keys share same cache line. But it is hard 
> to provide, given we place items at quite random place of the page.

Based on the the above quotes, I'm not suggesting we should use
Eytzinger for search-within-page.  But out of curiosity, I checked,
and saw that you can actually get a pair of index tuples holding a
six-character text key or an integer key into a 64 byte cache line.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: "SELECT ... FROM DUAL" is not quite as silly as it appears

2018-10-14 Thread Tom Lane
Robert Haas  writes:
> On Thu, Mar 15, 2018 at 11:27 AM, Tom Lane  wrote:
>> We've long made fun of Oracle(TM) for the fact that if you just want
>> to evaluate some expressions, you have to write "select ... from dual"
>> rather than just "select ...".  But I've realized recently that there's
>> a bit of method in that madness after all.  Specifically, having to
>> cope with FromExprs that contain no base relation is fairly problematic
>> in the planner.  prepjointree.c is basically unable to cope with
>> flattening a subquery that looks that way, although we've inserted a
>> lot of overly-baroque logic to handle some subsets of the case (cf
>> is_simple_subquery(), around line 1500).  If memory serves, there are
>> other places that are complicated by the case.
>> 
>> Suppose that, either in the rewriter or early in the planner, we were
>> to replace such cases with nonempty FromExprs, by adding a dummy RTE
>> representing a table with no columns and one row.  This would in turn
>> give rise to an ordinary Path that converts to a Result plan, so that
>> the case is handled without any special contortions later.  Then there
>> is no case where we don't have a nonempty relids set identifying a
>> subquery, so that all that special-case hackery in prepjointree.c
>> goes away, and we can simplify whatever else is having a hard time
>> with it.

> Good idea.

Here's a proposed patch along those lines.  Some notes for review:

* The new RTE kind is "RTE_RESULT" after the kind of Plan node it will
produce.  I'm not entirely in love with that name, but couldn't think
of a better idea.

* I renamed the existing ResultPath node type to GroupResultPath to
clarify that it's not used to scan RTE_RESULT RTEs, but just for
degenerate grouping cases.  RTE_RESULT RTEs (usually) give rise to plain
Path nodes with pathtype T_Result.  It doesn't work very well to try to
unify those two cases, even though they give rise to identical Plans,
because there are different rules about where the associated quals live
during query_planner.

* The interesting changes are in prepjointree.c; almost all the rest
of the patch is boilerplate to support RTE_RESULT RTEs in mostly the
same way that other special RTE-scan plan types are handled in the
planner.  In prepjointree.c, I ended up getting rid of the original
decision to try to delete removable RTEs during pull_up_subqueries,
and instead made it happen in a separate traversal of the join tree.
That's a lot less complicated, and it has better results because we
can optimize more cases once we've seen the results of expression
preprocessing and outer-join strength reduction.

* I tried to get rid of the empty-jointree special case in query_planner
altogether.  While the patch works fine that way, it makes for a
measurable slowdown in planning trivial queries --- I saw close to 10%
degradation in TPS rate for a pgbench test case that was just 
$ cat trivialselect.sql 
select 2+2;
$ pgbench -n -T 10 -f trivialselect.sql
So I ended up putting back the special case, but it's much less of a
cheat than it was before; the RelOptInfo it hands back is basically the
same as the normal path would produce.  For me, the patch as given is
within the noise level of being the same speed as HEAD on this case.

* There's a hack in nodeResult.c to prevent the executor from crashing
on a whole-row Var for an RTE_RESULT RTE, which is something that the
planner will create in SELECT FOR UPDATE cases, because it thinks it
needs to provide a ROW_MARK_COPY image of the RTE's output.  We might
be able to get rid of that if we could teach the planner that it need
not bother rowmarking RESULT RTEs, but that seemed like it would be
really messy.  (At the point where the decision is made, we don't know
whether a subquery might end up as just a RESULT, or indeed vanish
entirely.)  Since I couldn't measure any reproducible penalty from
having the extra setup cost for a Result plan, I left it like this.

* There are several existing test cases in join.sql whose plans change
for the better with this patch, so I didn't really think we needed any
additional test cases to show that it's working.

* There are a couple of cosmetic changes in EXPLAIN output as a result
of ruleutils.c seeing more RTEs in the plan's rtable than it did before,
causing it to decide to table-qualify Var names in more cases.  We could
maybe spend more effort on ruleutils.c's heuristic for when to qualify,
but that seems like a separable problem, and anyway it's only cosmetic.

I'll add this to the next CF.

regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79..efa5596 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2404,6 +2404,8 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 			case RTE_NAMEDTUPLESTORE:
 APP_JUMB_STRING(rte

Re: [RFC] Removing "magic" oids

2018-10-14 Thread Andres Freund
Hi,

On 2018-09-29 20:48:10 -0700, Andres Freund wrote:
> In my opinion the current WITH OIDs system has numerous weaknesses:
> 
> 1) The fact that oids are so magic means that if we get pluggable
>storage, the design of the potential pluggable systems is constrained
>and similar magic has to be present everywhere.
> 
> 2) The fact that the oids in each table have the same counter to be
>based on means that oid wraparounds have much worse consequences
>performance wise than necessary. E.g. once the global counter has
>wrapped, all toast tables start to be significantly slower.
> 
>It would be much better if most database objects had their own
>counters.
> 
> 3) For some oid using objects (toast, large objects at the very least)
>it'd be quite worthwhile to switch to 8 byte ids.  Currently that's
>hard to do, because it'd break on-disk compatibility.
> 
> 4) There's a lot of special case code around for dealing with oids.
> 
> 5a) The fact that system table oids don't show up in selects by default
>makes it more work than necessary to look at catalogs.
> 
> 5b) Similarly, it's fairly annoying when debugging not to trivially see
>oids for catalog structs.
> 
> 
> I think we should drop WITH OIDs support.  pg_dump should convert WITH
> OIDs tables into tables that have an explicit oid column (with an
> appropriate default function), pg_upgrade should refuse to upgrade them.
> We've defaulted WITH OIDs to off for quite a while now, so that's
> hopefully not going to be too painful.
> 
> For catalog tables, I think we should just add explicit oid columns.
> That obviously requires a fair amount of change, but it's not too bad.
> One issue here is that we we want to support "manual" inserts both for
> initdb, and for emergency work.
> 
> 
> The attached *PROTOTYPE* *WIP* patch removes WITH OIDs support, converts
> catalog table oids to explicit oid columns, and makes enough
> infrastructure changes to make plain pg_regress pass (after the
> necessary changes of course).  Only superficial psql and no pg_dump
> changes.
> 
> There's plenty of issues with the prototype, but overall I'm pretty
> pleased.
> 
> 
> There's three major areas I'm not so sure about:
> 
> 1) There's a few places dealing with system tables that don't deal with
>a hardcoded system table. Since there's no notion of "table has oid"
>and "which column is the oid column) anymore, we need some way to
>deal with that.  So far I've just extended existing objectaddress.c
>code to deal with that, but it's not that pretty.
> 
> 2) We need to be able to manually insert into catalog tables. Both
>initdb and emergency surgery.  My current hack is doing so directly
>in nodeModifyTable.c but that's beyond ugly.  I think we should add
>an explicit DEFAULT clause to those columns with something like
>nextoid('tablename', 'name_of_index') or such.
> 
> 3) The quickest way to deal with the bootstrap code was to just assign
>all oids for oid carrying tables that don't yet have any assigned.
>That doesn't generally seem terrible, although it's certainly badly
>implemented right now.  That'd mean we'd have three ranges of oids
>probably, unless we somehow have the bootstrap code advance the
>in-database oid counter to a right state before we start with
>post-bootstrap work.  I like the idea of all bootstrap time oids
>being determined by genbki.pl (I think that'll allow to remove some
>code too).
> 
> 
> I do wonder about ripping the oid counter out entirely, and replacing it
> with sequences. But that seems like a separate project.
> 
> 
> I'll polish this up further in the not too far away future.  But I'd
> really like to get some feedback before I sink more time into this.

Does anybody have engineering / architecture level comments about this
proposal?

Greetings,

Andres Freund



Re: "SELECT ... FROM DUAL" is not quite as silly as it appears

2018-10-14 Thread Thomas Munro
On Fri, Mar 16, 2018 at 4:27 AM Tom Lane  wrote:
> We've long made fun of Oracle(TM) for the fact that if you just want
> to evaluate some expressions, you have to write "select ... from dual"
> rather than just "select ...".  But I've realized recently that there's
> a bit of method in that madness after all.

We can still make fun of that table.  Apparently it had two rows, so
you could double rows by cross joining against it, but at some point
one of them went missing, leaving a strange name behind.  Source:
https://en.wikipedia.org/wiki/DUAL_table#History

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [RFC] Removing "magic" oids

2018-10-14 Thread Tom Lane
Andres Freund  writes:
> Does anybody have engineering / architecture level comments about this
> proposal?

FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes.  Yeah, it's
a wart we wouldn't have if we designed the system today, but the wart is
thirty years old.  I think changing that will break so many catalog
queries that we'll have the villagers on the doorstep.  Most of the other
things you're suggesting here could be done easily without making that
change.

Possibly we could make them not-magic from the storage standpoint (ie
they're regular columns) but have a pg_attribute flag that says not
to include them in "SELECT *" expansion.

BTW, the fact that we have magic OIDs in the system catalogs doesn't
mean that any other storage system has to support that.  (When I was
at Salesforce, we endured *substantial* pain from insisting that we
move the catalogs into their custom storage system.  There were good
reasons to do so, but it's not a decision I'd make again if there were
any plausible way not to.)

Personally, I'd drop WITH OIDs support for user tables altogether, rather
than having pg_dump create a "compatible" translation that won't be very
compatible if it loses the magicness aspect.

regards, tom lane



Re: pgsql: Add TAP tests for pg_verify_checksums

2018-10-14 Thread Michael Paquier
On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote:
> [snip]

Thanks a lot for the review, Andrew!

> This code now seems redundant:
> 
>  if (strcmp(fn, ".") == 0 ||
>      strcmp(fn, "..") == 0)
>      return true;

Indeed.  I have renamed skipfile() to isRelFileName on the way, which
looks better in the context.

> I would probably reverse the order of these two tests. It might not make any
> difference, assuming fn is never an empty string, but it seems more logical
> to me.
> 
>+    /* good to go if only digits */
>+    if (fn[pos] == '\0')
>+        return false;
>+    /* leave if no digits */
>+    if (pos == 0)
>+        return true;

No objections to that.  Done.

> It also looks to me like the check for a segment number doesn't ensure
> there is at least one digit, so "123." might pass, but I could be
> wrong. In any case, there isn't a test for that, and there probably
> should be.

You are right here.  This name pattern has been failing.  Fixed in the
attached.  I have added "123_" and "123_." as extra patterns to check.

What do you think about the updated version attached?
--
Michael
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 1bc020ab6c..666ab3f21e 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -15,6 +15,7 @@
 
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
+#include "common/relpath.h"
 #include "getopt_long.h"
 #include "pg_getopt.h"
 #include "storage/bufpage.h"
@@ -49,27 +50,69 @@ usage(void)
 	printf(_("Report bugs to .\n"));
 }
 
-static const char *const skip[] = {
-	"pg_control",
-	"pg_filenode.map",
-	"pg_internal.init",
-	"PG_VERSION",
-	NULL,
-};
-
+/*
+ * isRelFileName
+ *
+ * Check if the given file name has a shape authorized for scanning.
+ */
 static bool
-skipfile(const char *fn)
+isRelFileName(const char *fn)
 {
-	const char *const *f;
+	int			pos;
 
-	if (strcmp(fn, ".") == 0 ||
-		strcmp(fn, "..") == 0)
+	/*--
+	 * Only files including data checksums are authorized for scanning.  This
+	 * is guessed based on the file name, by reverse-engineering
+	 * GetRelationPath so make sure to update both code paths if any updates
+	 * are done. The following file names are allowed:
+	 * 
+	 * .
+	 * _
+	 * _.
+	 *
+	 * Note that temporary files, beginning with 't', are also skipped.
+	 *
+	 *--
+	 */
+
+	/* A non-empty string of digits should follow */
+	for (pos = 0; isdigit((unsigned char) fn[pos]); ++pos)
+		;
+	/* leave if no digits */
+	if (pos == 0)
+		return false;
+	/* good to go if only digits */
+	if (fn[pos] == '\0')
 		return true;
 
-	for (f = skip; *f; f++)
-		if (strcmp(*f, fn) == 0)
-			return true;
-	return false;
+	/* Authorized fork files can be scanned */
+	if (fn[pos] == '_')
+	{
+		int			forkchar = forkname_chars(&fn[pos + 1], NULL);
+
+		if (forkchar <= 0)
+			return false;
+
+		pos += forkchar + 1;
+	}
+
+	/* Check for an optional segment number */
+	if (fn[pos] == '.')
+	{
+		int			segchar;
+
+		for (segchar = 1; isdigit((unsigned char) fn[pos + segchar]); ++segchar)
+			;
+
+		if (segchar <= 1)
+			return false;
+		pos += segchar;
+	}
+
+	/* Now this should be the end */
+	if (fn[pos] != '\0')
+		return false;
+	return true;
 }
 
 static void
@@ -146,7 +189,7 @@ scan_directory(const char *basedir, const char *subdir)
 		char		fn[MAXPGPATH];
 		struct stat st;
 
-		if (skipfile(de->d_name))
+		if (!isRelFileName(de->d_name))
 			continue;
 
 		snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name);
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
index dc29da09af..8442534e95 100644
--- a/src/bin/pg_verify_checksums/t/002_actions.pl
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 12;
+use Test::More tests => 36;
 
 # Initialize node with checksums enabled.
 my $node = get_new_node('node_checksum');
@@ -17,6 +17,31 @@ command_like(['pg_controldata', $pgdata],
 	 qr/Data page checksum version:.*1/,
 		 'checksums enabled in control file');
 
+# Add set of dummy files with some contents.  These should not be scanned
+# by the tool.
+append_to_file "$pgdata/global/123.", "foo";
+append_to_file "$pgdata/global/123_", "foo";
+append_to_file "$pgdata/global/123_.", "foo";
+append_to_file "$pgdata/global/123.12t", "foo";
+append_to_file "$pgdata/global/foo", "foo2";
+append_to_file "$pgdata/global/t123", "bar";
+append_to_file "$pgdata/global/123a", "bar2";
+append_to_file "$pgdata/global/.123", "foobar";
+append_to_file "$pgdata/global/_fsm", "foobar2";
+append_to_file "$pgdata/global/_init", "foobar3";
+append_to_file "$pgdata/global/_vm.123", "foohoge";
+append_to_file "$pgdata/global/123_vm.123t", "foohoge2";
+
+# Those are correct but empty files, so they 

Re: [RFC] Removing "magic" oids

2018-10-14 Thread Andres Freund
Hi,

On 2018-10-14 18:34:50 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Does anybody have engineering / architecture level comments about this
> > proposal?
> 
> FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes.  Yeah, it's
> a wart we wouldn't have if we designed the system today, but the wart is
> thirty years old.  I think changing that will break so many catalog
> queries that we'll have the villagers on the doorstep.  Most of the other
> things you're suggesting here could be done easily without making that
> change.

Yea, I agree that it'll cause some pain. And we could easily 'de-magic'
oids everywhere except the SELECT * processing (that'd probably leave
some behavioural difference with composite types, but there'd it'd
probably be purely be better).


I'm not sure we want that however - yes, the short time pain will be
lower, but do we really want to inflict the confusion about invisible
oids on our users for the next 20 years? I think there's a fair argument
to be made that we should cause pain once, rather continuing to inflict
lower doeses of pain.


> Possibly we could make them not-magic from the storage standpoint (ie
> they're regular columns) but have a pg_attribute flag that says not
> to include them in "SELECT *" expansion.

Right. And we could just use that for system columns too, removing a
number of special case logic. Seems not unlikely that we'll have further
magic columns that want to be hidden by default, given Kevin is
pondering re-starting his work on incremental matviews.


> BTW, the fact that we have magic OIDs in the system catalogs doesn't
> mean that any other storage system has to support that.  (When I was
> at Salesforce, we endured *substantial* pain from insisting that we
> move the catalogs into their custom storage system.  There were good
> reasons to do so, but it's not a decision I'd make again if there were
> any plausible way not to.)

Right. I suspect that we at some point want to support having catalog
tables in different storage formats, but certainly not initially. Part
of the reason why I want to remove WITH OIDs support is precisely that I
do not want to add this kind of magic to other storage formats.


> Personally, I'd drop WITH OIDs support for user tables altogether, rather
> than having pg_dump create a "compatible" translation that won't be very
> compatible if it loses the magicness aspect.

Yea, I'm on-board with that.

Greetings,

Andres Freund



Re: [RFC] Removing "magic" oids

2018-10-14 Thread Thomas Munro
On Mon, Oct 15, 2018 at 11:35 AM Tom Lane  wrote:
> Andres Freund  writes:
> > Does anybody have engineering / architecture level comments about this
> > proposal?
>
> FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes.  Yeah, it's
> a wart we wouldn't have if we designed the system today, but the wart is
> thirty years old.  I think changing that will break so many catalog
> queries that we'll have the villagers on the doorstep.  Most of the other
> things you're suggesting here could be done easily without making that
> change.
>
> Possibly we could make them not-magic from the storage standpoint (ie
> they're regular columns) but have a pg_attribute flag that says not
> to include them in "SELECT *" expansion.

FWIW there is interest in a general facility for hiding arbitrary
attributes from SELECT * for other reasons too:

https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Add TAP tests for pg_verify_checksums

2018-10-14 Thread Andrew Dunstan




On 10/14/2018 06:41 PM, Michael Paquier wrote:

On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote:

[snip]

Thanks a lot for the review, Andrew!


This code now seems redundant:

  if (strcmp(fn, ".") == 0 ||
      strcmp(fn, "..") == 0)
      return true;

Indeed.  I have renamed skipfile() to isRelFileName on the way, which
looks better in the context.


I would probably reverse the order of these two tests. It might not make any
difference, assuming fn is never an empty string, but it seems more logical
to me.

+    /* good to go if only digits */
+    if (fn[pos] == '\0')
+        return false;
+    /* leave if no digits */
+    if (pos == 0)
+        return true;

No objections to that.  Done.


It also looks to me like the check for a segment number doesn't ensure
there is at least one digit, so "123." might pass, but I could be
wrong. In any case, there isn't a test for that, and there probably
should be.

You are right here.  This name pattern has been failing.  Fixed in the
attached.  I have added "123_" and "123_." as extra patterns to check.

What do you think about the updated version attached?



Looks good to me.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgbench - add pseudo-random permutation function

2018-10-14 Thread Hironobu SUZUKI

Hi,

I reviewed pgbench-prp-func-6.patch.

(1) modular_multiply()
In modular_multiply(), the numbers of digits of inputs are checked in 
order to determine using the interleaved modular multiplication 
algorithm or not. However, the check condition absolutely depends on the 
implementation of pseudorandom_perm() even though modular_multiply() is 
a general function.


I think it is better that pseudorandom_perm() directly checks the 
condition because it can be worked efficiently and modular_multiply() 
can be used in general purpose.


(2) pseudorandom_perm() and 001_pgbench_with_server.pl
Reading the discussion of __builtin_xxx functions, I remembered another 
overflow issue.


pseudorandom_perm() uses the Donald Knuth's linear congruential 
generator method to obtain pseudo-random numbers. This method, not only 
this but also all linear congruential generator methods, always overflows.


If we do not need to guarantee the portability of this code, we do not 
care about the result of this method because we just use *pseudo* random 
numbers. (In fact, I ignored it before.) However, since we have to 
guarantee the portability, we have to calculate it accurately. I 
therefore implemented the function dk_lcg() and rewrote pseudorandom_perm().


Just to be sure, I made a python code to check the algorithm of 
pseudorandom_perm() and run it.
Fortunately, my python code and pseudorandom_perm() which I rewrote 
returned the same answers, so I rewrote the answers in 
001_pgbench_with_server.pl.




I attached the new patch `pgbench-prp-func-7.patch`, the python code 
`pseudorandom_perm.py`, and the pr_perm check script file 
`pr_perm_check.sql`.


Best regards,


On 2018/10/01 12:19, Fabien COELHO wrote:

    PostgreSQL Hackers 
Subject: Re: pgbench - add pseudo-random permutation function

On Wed, Sep 26, 2018 at 01:20:49PM +0200, Fabien COELHO wrote:
So, am I right to deducing that you are satisfied with the current 
status of
the patch, with the nbits implementation either based on popcount 
(v4) or

clz (v5) compiler intrinsics? I think that the clz option is better.


Fabien, please note that v5 does not apply,


Indeed, tests interact with 92a0342 committed on Thursday.

Here is a rebase of the latest version based on CLZ. According to basic 
test I made, the underlying hardware instruction seems to be more often 
available.



so I moved this patch to next CF, waiting on author.


I'm going to change its state to "Needs review".




\set a1 pr_perm(0, 1000, 54321)
\set a2 pr_perm(1999, 1000, 54321)
\set a3 pr_perm(2500, 1000, 54321)
BEGIN;
DROP TABLE IF EXISTS test;
CREATE TABLE test (id int, data bigint);
INSERT INTO test VALUES(1, :a1);
INSERT INTO test VALUES(2, :a2);
INSERT INTO test VALUES(3, :a3);
END;
   

#!/usr/bin/python

PRP_ROUNDS = 4
DK_LCG_MUL = 6364136223846793005
DK_LCG_INC = 1442695040888963407

LCG_SHIFT = 13

PRP_PRIMES = 16
primes = [
8388617, 8912921, 9437189, 9961487, 10485767, 11010059, 11534351, 12058679,
12582917, 13107229, 13631489, 14155777, 14680067, 15204391, 15728681, 16252967
]

def compute_mask(n):
n = n|(n>>1)
n = n|(n>>2)
n = n|(n>>4)
n = n|(n>>8)
n = n|(n>>16)
n = n|(n>>32)
return n

def dk_lcg (key):
ret = (key * DK_LCG_MUL + DK_LCG_INC) % 0x
return ret

def psuderandom_perm(data, size, seed):
v = data % size
key = seed
mask = (compute_mask(size - 1)>>1)

if (size == 1):
return 0

p = key % PRP_PRIMES
for i in range(0, PRP_ROUNDS):
key = dk_lcg(key)
if (v <= mask):
v ^= ((key >> LCG_SHIFT) & mask)

key = dk_lcg(key)
t = size - 1 - v
if (t <= mask):
t ^= (key >> LCG_SHIFT) & mask
v = size - 1 - t

while (size % primes[p] == 0):
p = (p + 1) % PRP_PRIMES

key = dk_lcg(key)
v = ((primes[p] * v) + (key >> LCG_SHIFT)) % size

p = (p + 1) % PRP_PRIMES

return v

def print_ret(v,size,seed,ret):
s = 'pr_perm(' + str(v) + ', ' + str(size) + ', ' + str(seed) + ') = ' + str(ret)
print s

## main
for i in range(0, 11):
r = psuderandom_perm(i, 11, 5432)
print_ret(i,11,5432,r)

q = [0, 1999, 2500]

for i in q:
r = psuderandom_perm(i, 1000, 54321)
print_ret(i, 1000, 54321, r)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index af2dea1..5b09f73 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -327,6 +327,26 @@ fi])# PGAC_C_BUILTIN_BSWAP64
 
 
 
+# PGAC_C_BUILTIN_CLZLL
+# -
+# Check if the C compiler understands __builtin_clzll(),
+# and define HAVE__BUILTIN_CLZLL if so.
+# Both GCC & CLANG seem to have one.
+AC_DEFUN([PGAC_C_BUILTIN_CLZLL],
+[AC_CACHE_CHECK(for __builtin_clzll, pgac_cv__builtin_clzll,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static unsigned long

Re: "SELECT ... FROM DUAL" is not quite as silly as it appears

2018-10-14 Thread Tom Lane
I wrote:
> * There's a hack in nodeResult.c to prevent the executor from crashing
> on a whole-row Var for an RTE_RESULT RTE, which is something that the
> planner will create in SELECT FOR UPDATE cases, because it thinks it
> needs to provide a ROW_MARK_COPY image of the RTE's output.  We might
> be able to get rid of that if we could teach the planner that it need
> not bother rowmarking RESULT RTEs, but that seemed like it would be
> really messy.  (At the point where the decision is made, we don't know
> whether a subquery might end up as just a RESULT, or indeed vanish
> entirely.)  Since I couldn't measure any reproducible penalty from
> having the extra setup cost for a Result plan, I left it like this.

Well, I'd barely sent this when I realized that there was a better way.
The nodeResult.c hack predates my decision to postpone cleaning up
RTE_RESULT RTEs till near the end of the preprocessing phase, and
given that code, it is easy to get rid of rowmarking RESULT RTEs ...
in fact, the code was doing it already, except in the edge case of
a SELECT with only a RESULT RTE.  So here's a version that does not
touch nodeResult.c.

regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79..efa5596 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2404,6 +2404,8 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 			case RTE_NAMEDTUPLESTORE:
 APP_JUMB_STRING(rte->enrname);
 break;
+			case RTE_RESULT:
+break;
 			default:
 elog(ERROR, "unrecognized RTE kind: %d", (int) rte->rtekind);
 break;
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 21a2ef5..ac3722a 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5374,7 +5374,7 @@ INSERT INTO ft2 (c1,c2,c3) VALUES (1200,999,'foo') RETURNING tableoid::regclass;
QUERY PLAN
 -
  Insert on public.ft2
-   Output: (tableoid)::regclass
+   Output: (ft2.tableoid)::regclass
Remote SQL: INSERT INTO "S 1"."T 1"("C 1", c2, c3, c4, c5, c6, c7, c8) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
->  Result
  Output: 1200, 999, NULL::integer, 'foo'::text, NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2   '::character(10), NULL::user_enum
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421..8f7d4ba 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -437,9 +437,12 @@ ExecSupportsMarkRestore(Path *pathnode)
 return ExecSupportsMarkRestore(((ProjectionPath *) pathnode)->subpath);
 			else if (IsA(pathnode, MinMaxAggPath))
 return false;	/* childless Result */
+			else if (IsA(pathnode, GroupResultPath))
+return false;	/* childless Result */
 			else
 			{
-Assert(IsA(pathnode, ResultPath));
+/* Simple RTE_RESULT base relation */
+Assert(IsA(pathnode, Path));
 return false;	/* childless Result */
 			}
 
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index a10014f..ecaeeb3 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -2319,10 +2319,6 @@ range_table_walker(List *rtable,
 if (walker(rte->tablesample, context))
 	return true;
 break;
-			case RTE_CTE:
-			case RTE_NAMEDTUPLESTORE:
-/* nothing to do */
-break;
 			case RTE_SUBQUERY:
 if (!(flags & QTW_IGNORE_RT_SUBQUERIES))
 	if (walker(rte->subquery, context))
@@ -2345,6 +2341,11 @@ range_table_walker(List *rtable,
 if (walker(rte->values_lists, context))
 	return true;
 break;
+			case RTE_CTE:
+			case RTE_NAMEDTUPLESTORE:
+			case RTE_RESULT:
+/* nothing to do */
+break;
 		}
 
 		if (walker(rte->securityQuals, context))
@@ -3150,10 +3151,6 @@ range_table_mutator(List *rtable,
 	   TableSampleClause *);
 /* we don't bother to copy eref, aliases, etc; OK? */
 break;
-			case RTE_CTE:
-			case RTE_NAMEDTUPLESTORE:
-/* nothing to do */
-break;
 			case RTE_SUBQUERY:
 if (!(flags & QTW_IGNORE_RT_SUBQUERIES))
 {
@@ -3184,6 +3181,11 @@ range_table_mutator(List *rtable,
 			case RTE_VALUES:
 MUTATE(newrte->values_lists, rte->values_lists, List *);
 break;
+			case RTE_CTE:
+			case RTE_NAMEDTUPLESTORE:
+			case RTE_RESULT:
+/* nothing to do */
+break;
 		}
 		MUTATE(newrte->securityQuals, rte->securityQuals, List *);
 		newrt 

Re: WIP: Avoid creation of the free space map for small tables

2018-10-14 Thread Amit Kapila
On Sun, Oct 14, 2018 at 1:09 AM John Naylor  wrote:
>
> On 10/13/18, Amit Kapila  wrote:
>
> > I think you have found a good way to avoid creating FSM, but can't we
> > use some simpler technique like if the FSM fork for a relation doesn't
> > exist, then check the heapblk number for which we try to update the
> > FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
> > creating the FSM.
>
> I think I see what you mean, but to avoid the vacuum problem you just
> mentioned, we'd need to check the relation size, too.
>

Sure, but vacuum already has relation size.  In general, I think we
should try to avoid adding more system calls in the common code path.
It can impact the performance.

Few comments on your latest patch:
-
+static bool
+allow_write_to_fsm(Relation rel, BlockNumber heapBlk)
+{
+ BlockNumber heap_nblocks;
+
+ if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD ||
+ rel->rd_rel->relkind != RELKIND_RELATION)
+ return true;
+
+ /* XXX is this value cached? */
+ heap_nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD)
+ return true;
+ else
+ {
+ RelationOpenSmgr(rel);
+ return smgrexists(rel->rd_smgr, FSM_FORKNUM);
+ }
+}

I think you can avoid calling RelationGetNumberOfBlocks, if you call
smgrexists before and for the purpose of vacuum, we can get that as an
input parameter.  I think one can argue for not changing the interface
functions like RecordPageWithFreeSpace to avoid calling
RelationGetNumberOfBlocks, but to me, it appears worth to save the
additional system call.

-
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
- /*
- * If the FSM knows nothing of the rel, try the last page before we
- * give up and extend.  This avoids one-tuple-per-page syndrome during
- * bootstrapping or in a recently-started system.
- */
  if (targetBlock == InvalidBlockNumber)
- {
- BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
- if (nblocks > 0)
- targetBlock = nblocks - 1;
- }
+ targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
+   &try_every_page);


Is it possible to hide the magic of trying each block within
GetPageWithFreeSpace?  It will simplify the code and in future, if
another storage API has a different function for
RelationGetBufferForTuple, it will work seamlessly, provided they are
using same FSM.  One such user is zheap.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: B-tree cache prefetches

2018-10-14 Thread Andrey Borodin



> 15 окт. 2018 г., в 2:38, Thomas Munro  
> написал(а):
> 
> On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin  wrote:
>> 31 авг. 2018 г., в 2:40, Thomas Munro  
>> написал(а):
>> A related topic is the cache-unfriendliness of traditional binary
>> searches of sorted data.  Check out "Array Layouts for
>> Comparison-Based Searching"[1] if you haven't already.  It says that
>> if your array fits in L2 cache, as our btree pages do, then binary
>> search still wins against Eytzinger et al, which I was a bit
>> disappointed
>> 
>> I've re-read that paper. Their results are not about our case, here's a 
>> quote:
>>> For arrays small enough _to be kept_ in L2 cache, the branch-free binary 
>>> search code listed in Listing 2 is the fastest algorithm
>> 
>> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic 
>> possibility, except one simple problem: I do not see how can we insert items 
>> into this layout.
> 
> I don't know.  Aren't we talking about binary search within one 8KB
> page here, anyway?
The paper discuss multiple searches inside one 8Kb region, while we are doing 
single search in all-uncached page and move away. That's the difference.

Binserach is optimal, if the page is in L1 before we start search.

> 
> Thinking about some other ideas for cache prefetching in btree code,
> here's an idea to improve pipelining of index scans (a bit like the
> two-step pipeline idea from the hash join prefetch thread).  We know
> the TIDs of heap tuples we'll be looking up in the near future, so can
> we prefetch all the way to the tuple?   There are three random
> accesses that follow soon after reading an index tuple:
> BufTableLookup(), then page header, then the tuple itself.  At the
> logical extreme, we could anticipate the probe of SharedBufHash for
> the TID from index tuple i + 3 (perhaps dynahash could expose a
> hash_prefetch_for_hash() facility), and anticipate the read of the
> page header for the tuple pointed to by index tuple i + 2 (perhaps a
> dirty read of SharedBufHash that is only prepared to look at the first
> entry in the bucket would be OK for this), and anticipate the read of
> the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
> the page item table).  Or some less ambitious subset, if that pipeline
> turns out to be too deep; it seems like the simplest experiment would
> be to try prefetching just SharedBufHash lookups.  This may all be
> completely crazy, I haven't tried any of it.
This idea also looks good to me. One thing bothers me. if we do 
__builtin_prefetch(&a[b[c[d]]],0,1), and a, b, c and d all are not in cache - 
will we incur CPU wait time for fetching a,b and c?
This [0] guys are expoiting C++ coroutines for prefetching this kind of stuff, 
but it seems like too much of hassle.
> 
>> Also, that research is quite distant from B-tree binsearch: thier data items 
>> are small and do not represent reference to actual data. Eytzinger exploits 
>> the fact that two posiible future keys share same cache line. But it is hard 
>> to provide, given we place items at quite random place of the page.
> 
> Based on the the above quotes, I'm not suggesting we should use
> Eytzinger for search-within-page.  But out of curiosity, I checked,
> and saw that you can actually get a pair of index tuples holding a
> six-character text key or an integer key into a 64 byte cache line.

Well, this proves that in theory Eytzinger is an opportunity.

Best regards, Andrey Borodin.

[0] http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf


Re: PostgreSQL 11 RC1 + GA Dates

2018-10-14 Thread Sandeep Thakkar
Hi,

On Tue, Oct 2, 2018 at 7:28 PM Jonathan S. Katz 
wrote:

> Hi,
>
> Based on the current status of the open items and where we are at in the
> release cycle, the date for the first release candidate of PostgreSQL 11
> will be 2018-10-11.
>
> If all goes well with RC1, the PostgreSQL 11.0 GA release will be
> 2018-10-18. This is subject to change if we find any issues during the
> RC1 period that indicate we need to make an additional release candidate
> prior to going GA.
>
> So, is 11.0 GA release planned in this week?


> To the entire community, thank you for all of your hard work on
> developing PostgreSQL 11. It is exciting that we are finally at this
> point where we are ready to make our major release.
>
> Jonathan
>
>

-- 
Sandeep Thakkar


Re: TupleTableSlot abstraction

2018-10-14 Thread Amit Khandekar
On Sat, 13 Oct 2018 at 04:02, Andres Freund  wrote:
> > > > @@ -2024,7 +2024,18 @@ FormIndexDatum(IndexInfo *indexInfo,
> > > >   Datum   iDatum;
> > > >   boolisNull;
> > > >
> > > > - if (keycol != 0)
> > > > + if (keycol < 0)
> > > > + {
> > > > + HeapTupleTableSlot *hslot = (HeapTupleTableSlot 
> > > > *)slot;
> > > > +
> > > > + /* Only heap tuples have system attributes. */
> > > > + Assert(TTS_IS_HEAPTUPLE(slot) || 
> > > > TTS_IS_BUFFERTUPLE(slot));
> > > > +
> > > > + iDatum = heap_getsysattr(hslot->tuple, keycol,
> > > > +  
> > > > slot->tts_tupleDescriptor,
> > > > +  
> > > > &isNull);
> > > > + }
> > > > + else if (keycol != 0)
> > > >   {
> > > >   /*
> > > >* Plain index column; get the value we need 
> > > > directly from the
> > >
> > > This now should access the system column via the slot, right?  There's
> > > other places like this IIRC.
> >
> > Done. In FormIndexDatum() and ExecInterpExpr(), directly calling
> > slot_getsysattr() now.
> >
> > In ExecInterpExpr (), I am no longer using ExecEvalSysVar() now. I am
> > planning to remove this definition since it would be a single line
> > function just calling slot_getsysattr().
> >
> > In build_ExecEvalSysVar(), ExecEvalSysVar() is still used, so I
> > haven't removed the definition yet. I am planning to create a new
> > LLVMValueRef FuncSlotGetsysattr, and use that instead, in
> > build_ExecEvalSysVar(), or for that matter, I am thinking to revert
> > back build_ExecEvalSysVar() and instead have that code inline as in
> > HEAD.
>
> I'd just have ExecInterpExpr() continue calling ExecEvalSysVar. There's
> no reason for it to be inline.
Can you explain more why you think there should be a ExecEvalSysVar()
definition ? As I mentioned earlier it would just call
slot_getsysattr() and do nothing else.

> And it's simpler for JIT than the alternative.

You mean it would be simpler for JIT to call ExecEvalSysVar() than
slot_getsysattr() ? I didn't get why it is simpler.

Or are you talking considering build_ExecEvalSysVar() ? I am ok with
retaining build_ExecEvalSysVar() , but I was saying even inside this
function, we could do :
LLVMBuildCall( , llvm_get_decl(mod, FuncSlotGetsysattr) , .)
rather than:
LLVMFunctionType(,...)
LLVMAddFunction("ExecEvalSysVar", )
LLVMBuildCall(...)


>
> > > > @@ -185,6 +1020,7 @@ ExecResetTupleTable(List *tupleTable,/* tuple 
> > > > table */
> > > >   {
> > > >   TupleTableSlot *slot = lfirst_node(TupleTableSlot, lc);
> > > >
> > > > + slot->tts_cb->release(slot);
> > > >   /* Always release resources and reset the slot to empty */
> > > >   ExecClearTuple(slot);
> > > >   if (slot->tts_tupleDescriptor)
> > > > @@ -240,6 +1076,7 @@ void
> > > >  ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
> > > >  {
> > > >   /* This should match ExecResetTupleTable's processing of one slot 
> > > > */
> > > > + slot->tts_cb->release(slot);
> > > >   Assert(IsA(slot, TupleTableSlot));
> > > >   ExecClearTuple(slot);
> > > >   if (slot->tts_tupleDescriptor)
> > >
> > > ISTM that release should be called *after* clearing the slot.
> >
> > I am copying here what I discussed about this in the earlier reply:
> >
> > I am not sure what was release() designed to do. Currently all of the
> > implementers of this function are empty.
>
> So additional deallocations can happen in a slot. We might need this
> e.g. at some point for zheap which needs larger, longer-lived, buffers
> in slots.
>
> > Was it meant for doing
> > ReleaseTupleDesc(slot->tts_tupleDescriptor) ? Or
> > ReleaseBuffer(bslot->buffer) ?
>
> No. The former lives in generic code, the latter is in ClearTuple.
>
> > I think the purpose of keeping this *before* clearing the tuple might
> > be because the clear() might have already cleared some handles that
> > release() might need.
>
> It's just plainly wrong to call it this way round.

Ok.

>
>
> > I went ahead and did these changes, but for now, I haven't replaced
> > ExecFetchSlotTuple() with ExecFetchSlotHeapTuple(). Instead, I
> > retained ExecFetchSlotTuple() to be called for heap tuples, and added
> > a new ExecFetchGenericSlotTuple() to be used with shouldFree. I don't
> > like these names, but until we have concluded, I don't want to go
> > ahead and replace all the numerous ExecFetchSlotTuple() calls with
> > ExecFetchSlotHeapTuple().
>
> Why not?

I haven't gone ahead because I wanted to know if you are ok with the names.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company