Re: Experimenting with hash join prefetch
Hi, Thomas! > 14 окт. 2018 г., в 9:18, Thomas Munro > написал(а): > > + /* Prefetch the bucket for the next key */ > + uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1); > + uint32 next_bucket = next_hash % hashtable->nbuckets; > + __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]); +1, I also think that we should use __builtin_prefetch these days (years, actually). Exactly after listening Anastassia Ailamaki's (author of referenced paper) talk on VLDB I've proposed to do that for B-tree [0], but did not really pursuit that idea. [0] https://www.postgresql.org/message-id/3B774C9E-01E8-46A7-9642-7830DC1108F1%40yandex-team.ru
Re: B-tree cache prefetches
Hello! > 31 авг. 2018 г., в 2:40, Thomas Munro > написал(а): > A related topic is the cache-unfriendliness of traditional binary > searches of sorted data. Check out "Array Layouts for > Comparison-Based Searching"[1] if you haven't already. It says that > if your array fits in L2 cache, as our btree pages do, then binary > search still wins against Eytzinger et al, which I was a bit > disappointed I've re-read that paper. Their results are not about our case, here's a quote: > For arrays small enough _to be kept_ in L2 cache, the branch-free binary > search code listed in Listing 2 is the fastest algorithm Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout. Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page. Best regards, Andrey Borodin.
Re: Experimenting with hash join prefetch
> On Sun, 14 Oct 2018 at 06:19, Thomas Munro > wrote: > > Cache-oblivious hash joins cause a lot of TLB and cache misses. > ... > (There is another class of cache-aware hash join algorithms that partition > carefully up front to avoid them; that's not us.) Just out of curiosity, can you please elaborate more on this part (with references)? I'm thinking about this topic for a while, and I'm wondering, if by another class you mean something like this [1], then even if it's not us today, are there any issues that prevent from experimenting in this area? [1]: https://www.cse.ust.hk/catalac/papers/coqp_tods08.pdf
Re: Experimenting with hash join prefetch
On Mon, Oct 15, 2018 at 12:16 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sun, 14 Oct 2018 at 06:19, Thomas Munro > > wrote: > > Cache-oblivious hash joins cause a lot of TLB and cache misses. > > ... > > (There is another class of cache-aware hash join algorithms that partition > > carefully up front to avoid them; that's not us.) > > Just out of curiosity, can you please elaborate more on this part (with > references)? I'm thinking about this topic for a while, and I'm wondering, if > by another class you mean something like this [1], then even if it's not us > today, are there any issues that prevent from experimenting in this area? Hmm, I didn't mean the term-of-art "cache-oblivious" (Wikipedia definition: "an algorithm designed to take advantage of a CPU cache without having the size of the cache"), I meant not "cache-conscious" (we don't do anything at all to reduce cache misses, though obviously we could add prefetching to improve on that). The distinction I'm making is between "no partition" hash join (what we have), where you don't have to do any work up front, but you pay for a lot of cache misses during building and probing, and "partition" hash join, notably "radix join" (what MonetDB has?), where you have a partition phase before the build phase that aims to break the data up into enough partitions so that the hash tables will fit better in cache, making the later phases faster. There seems to be some disagreement about which is best -- passing through the data first is expensive, but so are cache misses on every probe, and there are claims that the winner depends on skew and tuple size. Here's some of the stuff I read/watched about this subject: https://15721.courses.cs.cmu.edu/spring2018/schedule.html#apr-04-2018 Add to that http://www.adms-conf.org/2017/camera-ready/Analyzing_In_Memory_Hash_Joins__Granularity_Matters.pdf. Skimming very quickly through the paper you posted, yeah, I mean exactly that stuff. Specifically I was thinking of the radix join mentioned in background section 2.3. (I see also that the same authors wrote a paper "Cache-Oblivious Hash Joins" which appears to describe a refinement of radix that doesn't need to be parameterised for cache size.) Sure, we could always consider these ideas. I am not personally working on that; to me it looked very hard to build, for a feature so uncertain to produce better results! (Note: we do of course have some kind of partitioning called "batching" when work_mem runs out, but it's not a kind of partitioning that cares about reducing cache misses, so if I understood correctly it's still "no partition" as far as this discussion goes.) -- Thomas Munro http://www.enterprisedb.com
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/13/2018 09:59 PM, Michael Paquier wrote: On Sat, Oct 13, 2018 at 05:53:00PM -0400, Andrew Dunstan wrote: It occurred to me that a pretty simple fix could just be to blacklist everything that didn't start with a digit. The whitelist approach is probably preferable... depends how urgent we see this as. Yeah, possibly, still that's not a correct long-term fix. So attached is what I think we should do for HEAD and REL_11_STABLE with an approach using a whitelist. I have added positive and negative tests on top of the existing TAP tests, as suggested by Michael B, and I made the code use relpath.h to make sure that we don't miss any fork types. Any objections to this patch? This code now seems redundant: if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; I would probably reverse the order of these two tests. It might not make any difference, assuming fn is never an empty string, but it seems more logical to me. + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; It also looks to me like the check for a segment number doesn't ensure there is at least one digit, so "123." might pass, but I could be wrong. In any case, there isn't a test for that, and there probably should be. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
rowcount estimate varies WRT partitionwise_join
I was crosseyed yesterday due to merge conflicts, but this still seems odd. I thought that final row counts would not vary with the details of the chosen plan. Which seems to hold true when I disable parallel join or hash join, but not for PWJ. I noticed this behavior while joining our own tables using eq join on the partition key plus an inequality comparison also on the partition key (range), but I see the same thing using tables from the regression test: pryzbyj=# EXPLAIN SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ; QUERY PLAN -- Hash Join (cost=6.96..13.83 rows=12 width=18) Hash Cond: (t2.b = t1.a) -> Append (cost=0.00..6.00 rows=200 width=9) -> Seq Scan on prt2_p1 t2 (cost=0.00..1.84 rows=84 width=9) -> Seq Scan on prt2_p2 t2_1 (cost=0.00..1.83 rows=83 width=9) -> Seq Scan on prt2_p3 t2_2 (cost=0.00..1.33 rows=33 width=9) -> Hash (cost=6.81..6.81 rows=12 width=9) -> Append (cost=0.00..6.81 rows=12 width=9) -> Seq Scan on prt1_p1 t1 (cost=0.00..2.56 rows=5 width=9) Filter: (b = 0) -> Seq Scan on prt1_p2 t1_1 (cost=0.00..2.56 rows=5 width=9) Filter: (b = 0) -> Seq Scan on prt1_p3 t1_2 (cost=0.00..1.62 rows=2 width=9) Filter: (b = 0) pryzbyj=# SET enable_partitionwise_join=on; pryzbyj=# EXPLAIN SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ; QUERY PLAN -- Append (cost=2.62..12.75 rows=7 width=18) -> Hash Join (cost=2.62..4.81 rows=3 width=18) Hash Cond: (t2.b = t1.a) -> Seq Scan on prt2_p1 t2 (cost=0.00..1.84 rows=84 width=9) -> Hash (cost=2.56..2.56 rows=5 width=9) -> Seq Scan on prt1_p1 t1 (cost=0.00..2.56 rows=5 width=9) Filter: (b = 0) -> Hash Join (cost=2.62..4.80 rows=3 width=18) Hash Cond: (t2_1.b = t1_1.a) -> Seq Scan on prt2_p2 t2_1 (cost=0.00..1.83 rows=83 width=9) -> Hash (cost=2.56..2.56 rows=5 width=9) -> Seq Scan on prt1_p2 t1_1 (cost=0.00..2.56 rows=5 width=9) Filter: (b = 0) -> Hash Join (cost=1.65..3.11 rows=1 width=18) Hash Cond: (t2_2.b = t1_2.a) -> Seq Scan on prt2_p3 t2_2 (cost=0.00..1.33 rows=33 width=9) -> Hash (cost=1.62..1.62 rows=2 width=9) -> Seq Scan on prt1_p3 t1_2 (cost=0.00..1.62 rows=2 width=9) Filter: (b = 0)
Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel
Andres Freund writes: > On 2018-10-12 19:47:40 -0400, Tom Lane wrote: >> So I went looking for a different example to plug in there, and soon >> found that there weren't any. If you change all the physically_coercible >> calls in that script to binary_coercible, its output doesn't change. >> I'm thinking that we ought to do that, and just get rid of >> physically_coercible(), so that we have a tighter, more semantically >> meaningful set of checks here. We can always undo that if we ever >> have occasion to type-cheat like that again, but offhand I'm not sure >> why we would do so. > Hm, I wonder if it's not a good idea to leave the test there, or rewrite > it slightly, so we have a a more precise warning about cheats like that? After thinking about this for awhile, I decided that physically_coercible() is poorly named, because it suggests that it might for instance allow any 4-byte type to be cast to any other one. Actually the additional cases it allows are just ones where an explicit binary-coercion cast would be needed. So I still think we should tighten up the tests while we can, but I left that function in place with a different name and a better comment. regards, tom lane
Re: rowcount estimate varies WRT partitionwise_join
Justin Pryzby writes: > I was crosseyed yesterday due to merge conflicts, but this still seems odd. > I thought that final row counts would not vary with the details of the chosen > plan. Which seems to hold true when I disable parallel join or hash join, but > not for PWJ. There are good reasons why that's not enabled by default yet ;-) regards, tom lane
Re: WIP: Avoid creation of the free space map for small tables
> On 10/13/18, Amit Kapila wrote: >> I think you have found a good way to avoid creating FSM, but can't we >> use some simpler technique like if the FSM fork for a relation doesn't >> exist, then check the heapblk number for which we try to update the >> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid >> creating the FSM. >> I think it would be better if we can find a common way to avoid >> creating FSM both during DO and REDO time. It might be possible if >> somethin like what I have said above is feasible. I've attached v4, which implements the REDO case, and as closely as possible to the DO case. I've created a new function to guard against creation of the FSM, which is called by RecordPageWithFreeSpace() and RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace() takes a relfilenode and not a relation, I had to reimplement that separately, but the logic is basically the same. It works under streaming replication. I've also attached a couple SQL scripts which, when the aforementioned DEBUG1 calls are enabled, show what the heap insert code is doing for different scenarios. Make check-world passes. -John Naylor From f3abe9fc003430530db71ddd40c4e9bed8a38513 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sun, 14 Oct 2018 22:35:12 +0700 Subject: [PATCH v4] Avoid creation of the free space map for small tables. The FSM isn't created if the heap has fewer than 10 blocks. If the last known good block has insufficient space, try every block before extending the heap. If a heap with a FSM is truncated back to below the threshold, the FSM stays around and can be used as usual. --- contrib/pageinspect/expected/page.out | 77 ++-- contrib/pageinspect/sql/page.sql | 33 -- src/backend/access/heap/hio.c | 137 +- src/backend/storage/freespace/freespace.c | 61 +- src/include/storage/freespace.h | 4 + 5 files changed, 229 insertions(+), 83 deletions(-) diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out index 3fcd9fbe6d..83e5910453 100644 --- a/contrib/pageinspect/expected/page.out +++ b/contrib/pageinspect/expected/page.out @@ -1,48 +1,69 @@ CREATE EXTENSION pageinspect; -CREATE TABLE test1 (a int, b int); -INSERT INTO test1 VALUES (16777217, 131584); -VACUUM test1; -- set up FSM +CREATE TABLE test_rel_forks (a int); +-- Make sure there are enough blocks in the heap for the FSM to be created. +INSERT INTO test_rel_forks SELECT g from generate_series(1,1) g; +-- set up FSM and VM +VACUUM test_rel_forks; -- The page contents can vary, so just test that it can be read -- successfully, but don't keep the output. -SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0; +SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0; main_0 8192 (1 row) -SELECT octet_length(get_raw_page('test1', 'main', 1)) AS main_1; -ERROR: block number 1 is out of range for relation "test1" -SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0; +SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; +ERROR: block number 100 is out of range for relation "test_rel_forks" +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; fsm_0 --- 8192 (1 row) -SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1; - fsm_1 - 8192 -(1 row) - -SELECT octet_length(get_raw_page('test1', 'vm', 0)) AS vm_0; +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; +ERROR: block number 10 is out of range for relation "test_rel_forks" +SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0; vm_0 -- 8192 (1 row) -SELECT octet_length(get_raw_page('test1', 'vm', 1)) AS vm_1; -ERROR: block number 1 is out of range for relation "test1" +SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1; +ERROR: block number 1 is out of range for relation "test_rel_forks" SELECT octet_length(get_raw_page('xxx', 'main', 0)); ERROR: relation "xxx" does not exist -SELECT octet_length(get_raw_page('test1', 'xxx', 0)); +SELECT octet_length(get_raw_page('test_rel_forks', 'xxx', 0)); ERROR: invalid fork name HINT: Valid fork names are "main", "fsm", "vm", and "init". -SELECT get_raw_page('test1', 0) = get_raw_page('test1', 'main', 0); +SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); + fsm_page_contents +--- + 0: 192 + + 1: 192 + + 3: 192 + + 7: 192 + + 15: 192 + + 31: 192 + + 63: 192 + + 127: 192 + + 255: 192 + + 511: 192 + + 1023: 192+ + 2047: 192+ + 4095: 192+ + fp_next_slot: 0 + + +(1 row) + +SELECT get_raw_page('test_rel_forks', 0) = get_raw_page('test_rel_forks', 'main', 0); ?column? -- t (1 row) +DROP TABLE test_rel_forks; +CREATE TABLE test
Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT
On Tue, Oct 2, 2018 at 3:40 AM Michael Paquier wrote: > > The last patch set does not apply, so this is moved to next CF, waiting > on author as new status. > Updated the last patch so it can apply cleanly on HEAD. About the bugfixes, do you think it is better to move to another thread? Best regards, -- Matheus de Oliveira diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index f13a6cd944..5910680cf3 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -55,7 +55,9 @@ ALTER TABLE [ IF EXISTS ] name ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } ADD table_constraint [ NOT VALID ] ADD table_constraint_using_index -ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] +ALTER CONSTRAINT constraint_name + [ ON DELETE action ] [ ON UPDATE action ] + [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] VALIDATE CONSTRAINT constraint_name DROP CONSTRAINT [ IF EXISTS ] constraint_name [ RESTRICT | CASCADE ] DISABLE TRIGGER [ trigger_name | ALL | USER ] diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3e112b4ef4..86dabc9bbc 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -7843,8 +7843,43 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, errmsg("constraint \"%s\" of relation \"%s\" is not a foreign key constraint", cmdcon->conname, RelationGetRelationName(rel; + /* + * Verify for FKCONSTR_ACTION_UNKNOWN, if found, replace by current + * action. We could handle FKCONSTR_ACTION_UNKNOWN bellow, but since + * we already have to handle the case of changing to the same action, + * seems simpler to simple replace FKCONSTR_ACTION_UNKNOWN by the + * current action here. + */ + if (cmdcon->fk_del_action == FKCONSTR_ACTION_UNKNOWN) + cmdcon->fk_del_action = currcon->confdeltype; + + if (cmdcon->fk_upd_action == FKCONSTR_ACTION_UNKNOWN) + cmdcon->fk_upd_action = currcon->confupdtype; + + /* + * Do the same for deferrable attributes. But consider that if changed + * only initdeferred attribute and to true, force deferrable to be also + * true. On the other hand, if changed only deferrable attribute and to + * false, force initdeferred to be also false. + */ + if (!cmdcon->was_deferrable_set) + cmdcon->deferrable = cmdcon->initdeferred ? true : currcon->condeferrable; + + if (!cmdcon->was_initdeferred_set) + cmdcon->initdeferred = !cmdcon->deferrable ? false : currcon->condeferred; + + /* + * This is a safe check only, should never happen here. + */ + if (cmdcon->initdeferred && !cmdcon->deferrable) + ereport(ERROR, +(errcode(ERRCODE_SYNTAX_ERROR), + errmsg("constraint declared INITIALLY DEFERRED must be DEFERRABLE"))); + if (currcon->condeferrable != cmdcon->deferrable || - currcon->condeferred != cmdcon->initdeferred) + currcon->condeferred != cmdcon->initdeferred || + currcon->confdeltype != cmdcon->fk_del_action || + currcon->confupdtype != cmdcon->fk_upd_action) { HeapTuple copyTuple; HeapTuple tgtuple; @@ -7862,6 +7897,8 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, copy_con = (Form_pg_constraint) GETSTRUCT(copyTuple); copy_con->condeferrable = cmdcon->deferrable; copy_con->condeferred = cmdcon->initdeferred; + copy_con->confdeltype = cmdcon->fk_del_action; + copy_con->confupdtype = cmdcon->fk_upd_action; CatalogTupleUpdate(conrel, ©Tuple->t_self, copyTuple); InvokeObjectPostAlterHook(ConstraintRelationId, @@ -7898,23 +7935,106 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, otherrelids = list_append_unique_oid(otherrelids, tgform->tgrelid); - /* - * Update deferrability of RI_FKey_noaction_del, - * RI_FKey_noaction_upd, RI_FKey_check_ins and RI_FKey_check_upd - * triggers, but not others; see createForeignKeyTriggers and - * CreateFKCheckTrigger. - */ - if (tgform->tgfoid != F_RI_FKEY_NOACTION_DEL && -tgform->tgfoid != F_RI_FKEY_NOACTION_UPD && -tgform->tgfoid != F_RI_FKEY_CHECK_INS && -tgform->tgfoid != F_RI_FKEY_CHECK_UPD) -continue; - copyTuple = heap_copytuple(tgtuple); copy_tg = (Form_pg_trigger) GETSTRUCT(copyTuple); + /* + * Set deferrability here, but note that it may be overridden bellow + * if the pg_trigger entry is on the referencing table and depending + * on the action used for ON UPDATE/DELETE. But for check triggers + * (in the referenced table) it is kept as is (since ON + * UPDATE/DELETE actions makes no difference for the check + * triggers). + */ copy_tg->tgdeferrable = cmdcon->deferrable; copy_tg->tginitdeferred = cmdcon->initdeferred; + + /* + * Set ON DELETE action + */ + if (tgform->tgfoid == F_RI_FKEY_NOACTION_DEL || +tgform->tgfoid == F_RI_FKEY_RESTRICT_DEL || +tgfor
Re: B-tree cache prefetches
On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin wrote: > 31 авг. 2018 г., в 2:40, Thomas Munro > написал(а): > A related topic is the cache-unfriendliness of traditional binary > searches of sorted data. Check out "Array Layouts for > Comparison-Based Searching"[1] if you haven't already. It says that > if your array fits in L2 cache, as our btree pages do, then binary > search still wins against Eytzinger et al, which I was a bit > disappointed > > I've re-read that paper. Their results are not about our case, here's a quote: > > For arrays small enough _to be kept_ in L2 cache, the branch-free binary > > search code listed in Listing 2 is the fastest algorithm > > Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic > possibility, except one simple problem: I do not see how can we insert items > into this layout. I don't know. Aren't we talking about binary search within one 8KB page here, anyway? Thinking about some other ideas for cache prefetching in btree code, here's an idea to improve pipelining of index scans (a bit like the two-step pipeline idea from the hash join prefetch thread). We know the TIDs of heap tuples we'll be looking up in the near future, so can we prefetch all the way to the tuple? There are three random accesses that follow soon after reading an index tuple: BufTableLookup(), then page header, then the tuple itself. At the logical extreme, we could anticipate the probe of SharedBufHash for the TID from index tuple i + 3 (perhaps dynahash could expose a hash_prefetch_for_hash() facility), and anticipate the read of the page header for the tuple pointed to by index tuple i + 2 (perhaps a dirty read of SharedBufHash that is only prepared to look at the first entry in the bucket would be OK for this), and anticipate the read of the tuple for the TID from index tuple i + 1 (perhaps a dirty read of the page item table). Or some less ambitious subset, if that pipeline turns out to be too deep; it seems like the simplest experiment would be to try prefetching just SharedBufHash lookups. This may all be completely crazy, I haven't tried any of it. > Also, that research is quite distant from B-tree binsearch: thier data items > are small and do not represent reference to actual data. Eytzinger exploits > the fact that two posiible future keys share same cache line. But it is hard > to provide, given we place items at quite random place of the page. Based on the the above quotes, I'm not suggesting we should use Eytzinger for search-within-page. But out of curiosity, I checked, and saw that you can actually get a pair of index tuples holding a six-character text key or an integer key into a 64 byte cache line. -- Thomas Munro http://www.enterprisedb.com
Re: "SELECT ... FROM DUAL" is not quite as silly as it appears
Robert Haas writes: > On Thu, Mar 15, 2018 at 11:27 AM, Tom Lane wrote: >> We've long made fun of Oracle(TM) for the fact that if you just want >> to evaluate some expressions, you have to write "select ... from dual" >> rather than just "select ...". But I've realized recently that there's >> a bit of method in that madness after all. Specifically, having to >> cope with FromExprs that contain no base relation is fairly problematic >> in the planner. prepjointree.c is basically unable to cope with >> flattening a subquery that looks that way, although we've inserted a >> lot of overly-baroque logic to handle some subsets of the case (cf >> is_simple_subquery(), around line 1500). If memory serves, there are >> other places that are complicated by the case. >> >> Suppose that, either in the rewriter or early in the planner, we were >> to replace such cases with nonempty FromExprs, by adding a dummy RTE >> representing a table with no columns and one row. This would in turn >> give rise to an ordinary Path that converts to a Result plan, so that >> the case is handled without any special contortions later. Then there >> is no case where we don't have a nonempty relids set identifying a >> subquery, so that all that special-case hackery in prepjointree.c >> goes away, and we can simplify whatever else is having a hard time >> with it. > Good idea. Here's a proposed patch along those lines. Some notes for review: * The new RTE kind is "RTE_RESULT" after the kind of Plan node it will produce. I'm not entirely in love with that name, but couldn't think of a better idea. * I renamed the existing ResultPath node type to GroupResultPath to clarify that it's not used to scan RTE_RESULT RTEs, but just for degenerate grouping cases. RTE_RESULT RTEs (usually) give rise to plain Path nodes with pathtype T_Result. It doesn't work very well to try to unify those two cases, even though they give rise to identical Plans, because there are different rules about where the associated quals live during query_planner. * The interesting changes are in prepjointree.c; almost all the rest of the patch is boilerplate to support RTE_RESULT RTEs in mostly the same way that other special RTE-scan plan types are handled in the planner. In prepjointree.c, I ended up getting rid of the original decision to try to delete removable RTEs during pull_up_subqueries, and instead made it happen in a separate traversal of the join tree. That's a lot less complicated, and it has better results because we can optimize more cases once we've seen the results of expression preprocessing and outer-join strength reduction. * I tried to get rid of the empty-jointree special case in query_planner altogether. While the patch works fine that way, it makes for a measurable slowdown in planning trivial queries --- I saw close to 10% degradation in TPS rate for a pgbench test case that was just $ cat trivialselect.sql select 2+2; $ pgbench -n -T 10 -f trivialselect.sql So I ended up putting back the special case, but it's much less of a cheat than it was before; the RelOptInfo it hands back is basically the same as the normal path would produce. For me, the patch as given is within the noise level of being the same speed as HEAD on this case. * There's a hack in nodeResult.c to prevent the executor from crashing on a whole-row Var for an RTE_RESULT RTE, which is something that the planner will create in SELECT FOR UPDATE cases, because it thinks it needs to provide a ROW_MARK_COPY image of the RTE's output. We might be able to get rid of that if we could teach the planner that it need not bother rowmarking RESULT RTEs, but that seemed like it would be really messy. (At the point where the decision is made, we don't know whether a subquery might end up as just a RESULT, or indeed vanish entirely.) Since I couldn't measure any reproducible penalty from having the extra setup cost for a Result plan, I left it like this. * There are several existing test cases in join.sql whose plans change for the better with this patch, so I didn't really think we needed any additional test cases to show that it's working. * There are a couple of cosmetic changes in EXPLAIN output as a result of ruleutils.c seeing more RTEs in the plan's rtable than it did before, causing it to decide to table-qualify Var names in more cases. We could maybe spend more effort on ruleutils.c's heuristic for when to qualify, but that seems like a separable problem, and anyway it's only cosmetic. I'll add this to the next CF. regards, tom lane diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 33f9a79..efa5596 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -2404,6 +2404,8 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable) case RTE_NAMEDTUPLESTORE: APP_JUMB_STRING(rte
Re: [RFC] Removing "magic" oids
Hi, On 2018-09-29 20:48:10 -0700, Andres Freund wrote: > In my opinion the current WITH OIDs system has numerous weaknesses: > > 1) The fact that oids are so magic means that if we get pluggable >storage, the design of the potential pluggable systems is constrained >and similar magic has to be present everywhere. > > 2) The fact that the oids in each table have the same counter to be >based on means that oid wraparounds have much worse consequences >performance wise than necessary. E.g. once the global counter has >wrapped, all toast tables start to be significantly slower. > >It would be much better if most database objects had their own >counters. > > 3) For some oid using objects (toast, large objects at the very least) >it'd be quite worthwhile to switch to 8 byte ids. Currently that's >hard to do, because it'd break on-disk compatibility. > > 4) There's a lot of special case code around for dealing with oids. > > 5a) The fact that system table oids don't show up in selects by default >makes it more work than necessary to look at catalogs. > > 5b) Similarly, it's fairly annoying when debugging not to trivially see >oids for catalog structs. > > > I think we should drop WITH OIDs support. pg_dump should convert WITH > OIDs tables into tables that have an explicit oid column (with an > appropriate default function), pg_upgrade should refuse to upgrade them. > We've defaulted WITH OIDs to off for quite a while now, so that's > hopefully not going to be too painful. > > For catalog tables, I think we should just add explicit oid columns. > That obviously requires a fair amount of change, but it's not too bad. > One issue here is that we we want to support "manual" inserts both for > initdb, and for emergency work. > > > The attached *PROTOTYPE* *WIP* patch removes WITH OIDs support, converts > catalog table oids to explicit oid columns, and makes enough > infrastructure changes to make plain pg_regress pass (after the > necessary changes of course). Only superficial psql and no pg_dump > changes. > > There's plenty of issues with the prototype, but overall I'm pretty > pleased. > > > There's three major areas I'm not so sure about: > > 1) There's a few places dealing with system tables that don't deal with >a hardcoded system table. Since there's no notion of "table has oid" >and "which column is the oid column) anymore, we need some way to >deal with that. So far I've just extended existing objectaddress.c >code to deal with that, but it's not that pretty. > > 2) We need to be able to manually insert into catalog tables. Both >initdb and emergency surgery. My current hack is doing so directly >in nodeModifyTable.c but that's beyond ugly. I think we should add >an explicit DEFAULT clause to those columns with something like >nextoid('tablename', 'name_of_index') or such. > > 3) The quickest way to deal with the bootstrap code was to just assign >all oids for oid carrying tables that don't yet have any assigned. >That doesn't generally seem terrible, although it's certainly badly >implemented right now. That'd mean we'd have three ranges of oids >probably, unless we somehow have the bootstrap code advance the >in-database oid counter to a right state before we start with >post-bootstrap work. I like the idea of all bootstrap time oids >being determined by genbki.pl (I think that'll allow to remove some >code too). > > > I do wonder about ripping the oid counter out entirely, and replacing it > with sequences. But that seems like a separate project. > > > I'll polish this up further in the not too far away future. But I'd > really like to get some feedback before I sink more time into this. Does anybody have engineering / architecture level comments about this proposal? Greetings, Andres Freund
Re: "SELECT ... FROM DUAL" is not quite as silly as it appears
On Fri, Mar 16, 2018 at 4:27 AM Tom Lane wrote: > We've long made fun of Oracle(TM) for the fact that if you just want > to evaluate some expressions, you have to write "select ... from dual" > rather than just "select ...". But I've realized recently that there's > a bit of method in that madness after all. We can still make fun of that table. Apparently it had two rows, so you could double rows by cross joining against it, but at some point one of them went missing, leaving a strange name behind. Source: https://en.wikipedia.org/wiki/DUAL_table#History -- Thomas Munro http://www.enterprisedb.com
Re: [RFC] Removing "magic" oids
Andres Freund writes: > Does anybody have engineering / architecture level comments about this > proposal? FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes. Yeah, it's a wart we wouldn't have if we designed the system today, but the wart is thirty years old. I think changing that will break so many catalog queries that we'll have the villagers on the doorstep. Most of the other things you're suggesting here could be done easily without making that change. Possibly we could make them not-magic from the storage standpoint (ie they're regular columns) but have a pg_attribute flag that says not to include them in "SELECT *" expansion. BTW, the fact that we have magic OIDs in the system catalogs doesn't mean that any other storage system has to support that. (When I was at Salesforce, we endured *substantial* pain from insisting that we move the catalogs into their custom storage system. There were good reasons to do so, but it's not a decision I'd make again if there were any plausible way not to.) Personally, I'd drop WITH OIDs support for user tables altogether, rather than having pg_dump create a "compatible" translation that won't be very compatible if it loses the magicness aspect. regards, tom lane
Re: pgsql: Add TAP tests for pg_verify_checksums
On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote: > [snip] Thanks a lot for the review, Andrew! > This code now seems redundant: > > if (strcmp(fn, ".") == 0 || > strcmp(fn, "..") == 0) > return true; Indeed. I have renamed skipfile() to isRelFileName on the way, which looks better in the context. > I would probably reverse the order of these two tests. It might not make any > difference, assuming fn is never an empty string, but it seems more logical > to me. > >+ /* good to go if only digits */ >+ if (fn[pos] == '\0') >+ return false; >+ /* leave if no digits */ >+ if (pos == 0) >+ return true; No objections to that. Done. > It also looks to me like the check for a segment number doesn't ensure > there is at least one digit, so "123." might pass, but I could be > wrong. In any case, there isn't a test for that, and there probably > should be. You are right here. This name pattern has been failing. Fixed in the attached. I have added "123_" and "123_." as extra patterns to check. What do you think about the updated version attached? -- Michael diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index 1bc020ab6c..666ab3f21e 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -15,6 +15,7 @@ #include "catalog/pg_control.h" #include "common/controldata_utils.h" +#include "common/relpath.h" #include "getopt_long.h" #include "pg_getopt.h" #include "storage/bufpage.h" @@ -49,27 +50,69 @@ usage(void) printf(_("Report bugs to .\n")); } -static const char *const skip[] = { - "pg_control", - "pg_filenode.map", - "pg_internal.init", - "PG_VERSION", - NULL, -}; - +/* + * isRelFileName + * + * Check if the given file name has a shape authorized for scanning. + */ static bool -skipfile(const char *fn) +isRelFileName(const char *fn) { - const char *const *f; + int pos; - if (strcmp(fn, ".") == 0 || - strcmp(fn, "..") == 0) + /*-- + * Only files including data checksums are authorized for scanning. This + * is guessed based on the file name, by reverse-engineering + * GetRelationPath so make sure to update both code paths if any updates + * are done. The following file names are allowed: + * + * . + * _ + * _. + * + * Note that temporary files, beginning with 't', are also skipped. + * + *-- + */ + + /* A non-empty string of digits should follow */ + for (pos = 0; isdigit((unsigned char) fn[pos]); ++pos) + ; + /* leave if no digits */ + if (pos == 0) + return false; + /* good to go if only digits */ + if (fn[pos] == '\0') return true; - for (f = skip; *f; f++) - if (strcmp(*f, fn) == 0) - return true; - return false; + /* Authorized fork files can be scanned */ + if (fn[pos] == '_') + { + int forkchar = forkname_chars(&fn[pos + 1], NULL); + + if (forkchar <= 0) + return false; + + pos += forkchar + 1; + } + + /* Check for an optional segment number */ + if (fn[pos] == '.') + { + int segchar; + + for (segchar = 1; isdigit((unsigned char) fn[pos + segchar]); ++segchar) + ; + + if (segchar <= 1) + return false; + pos += segchar; + } + + /* Now this should be the end */ + if (fn[pos] != '\0') + return false; + return true; } static void @@ -146,7 +189,7 @@ scan_directory(const char *basedir, const char *subdir) char fn[MAXPGPATH]; struct stat st; - if (skipfile(de->d_name)) + if (!isRelFileName(de->d_name)) continue; snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name); diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl index dc29da09af..8442534e95 100644 --- a/src/bin/pg_verify_checksums/t/002_actions.pl +++ b/src/bin/pg_verify_checksums/t/002_actions.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 12; +use Test::More tests => 36; # Initialize node with checksums enabled. my $node = get_new_node('node_checksum'); @@ -17,6 +17,31 @@ command_like(['pg_controldata', $pgdata], qr/Data page checksum version:.*1/, 'checksums enabled in control file'); +# Add set of dummy files with some contents. These should not be scanned +# by the tool. +append_to_file "$pgdata/global/123.", "foo"; +append_to_file "$pgdata/global/123_", "foo"; +append_to_file "$pgdata/global/123_.", "foo"; +append_to_file "$pgdata/global/123.12t", "foo"; +append_to_file "$pgdata/global/foo", "foo2"; +append_to_file "$pgdata/global/t123", "bar"; +append_to_file "$pgdata/global/123a", "bar2"; +append_to_file "$pgdata/global/.123", "foobar"; +append_to_file "$pgdata/global/_fsm", "foobar2"; +append_to_file "$pgdata/global/_init", "foobar3"; +append_to_file "$pgdata/global/_vm.123", "foohoge"; +append_to_file "$pgdata/global/123_vm.123t", "foohoge2"; + +# Those are correct but empty files, so they
Re: [RFC] Removing "magic" oids
Hi, On 2018-10-14 18:34:50 -0400, Tom Lane wrote: > Andres Freund writes: > > Does anybody have engineering / architecture level comments about this > > proposal? > > FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes. Yeah, it's > a wart we wouldn't have if we designed the system today, but the wart is > thirty years old. I think changing that will break so many catalog > queries that we'll have the villagers on the doorstep. Most of the other > things you're suggesting here could be done easily without making that > change. Yea, I agree that it'll cause some pain. And we could easily 'de-magic' oids everywhere except the SELECT * processing (that'd probably leave some behavioural difference with composite types, but there'd it'd probably be purely be better). I'm not sure we want that however - yes, the short time pain will be lower, but do we really want to inflict the confusion about invisible oids on our users for the next 20 years? I think there's a fair argument to be made that we should cause pain once, rather continuing to inflict lower doeses of pain. > Possibly we could make them not-magic from the storage standpoint (ie > they're regular columns) but have a pg_attribute flag that says not > to include them in "SELECT *" expansion. Right. And we could just use that for system columns too, removing a number of special case logic. Seems not unlikely that we'll have further magic columns that want to be hidden by default, given Kevin is pondering re-starting his work on incremental matviews. > BTW, the fact that we have magic OIDs in the system catalogs doesn't > mean that any other storage system has to support that. (When I was > at Salesforce, we endured *substantial* pain from insisting that we > move the catalogs into their custom storage system. There were good > reasons to do so, but it's not a decision I'd make again if there were > any plausible way not to.) Right. I suspect that we at some point want to support having catalog tables in different storage formats, but certainly not initially. Part of the reason why I want to remove WITH OIDs support is precisely that I do not want to add this kind of magic to other storage formats. > Personally, I'd drop WITH OIDs support for user tables altogether, rather > than having pg_dump create a "compatible" translation that won't be very > compatible if it loses the magicness aspect. Yea, I'm on-board with that. Greetings, Andres Freund
Re: [RFC] Removing "magic" oids
On Mon, Oct 15, 2018 at 11:35 AM Tom Lane wrote: > Andres Freund writes: > > Does anybody have engineering / architecture level comments about this > > proposal? > > FWIW, I'm -1 on making OIDs be not-magic for SELECT purposes. Yeah, it's > a wart we wouldn't have if we designed the system today, but the wart is > thirty years old. I think changing that will break so many catalog > queries that we'll have the villagers on the doorstep. Most of the other > things you're suggesting here could be done easily without making that > change. > > Possibly we could make them not-magic from the storage standpoint (ie > they're regular columns) but have a pg_attribute flag that says not > to include them in "SELECT *" expansion. FWIW there is interest in a general facility for hiding arbitrary attributes from SELECT * for other reasons too: https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: pgsql: Add TAP tests for pg_verify_checksums
On 10/14/2018 06:41 PM, Michael Paquier wrote: On Sun, Oct 14, 2018 at 10:24:55AM -0400, Andrew Dunstan wrote: [snip] Thanks a lot for the review, Andrew! This code now seems redundant: if (strcmp(fn, ".") == 0 || strcmp(fn, "..") == 0) return true; Indeed. I have renamed skipfile() to isRelFileName on the way, which looks better in the context. I would probably reverse the order of these two tests. It might not make any difference, assuming fn is never an empty string, but it seems more logical to me. + /* good to go if only digits */ + if (fn[pos] == '\0') + return false; + /* leave if no digits */ + if (pos == 0) + return true; No objections to that. Done. It also looks to me like the check for a segment number doesn't ensure there is at least one digit, so "123." might pass, but I could be wrong. In any case, there isn't a test for that, and there probably should be. You are right here. This name pattern has been failing. Fixed in the attached. I have added "123_" and "123_." as extra patterns to check. What do you think about the updated version attached? Looks good to me. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - add pseudo-random permutation function
Hi, I reviewed pgbench-prp-func-6.patch. (1) modular_multiply() In modular_multiply(), the numbers of digits of inputs are checked in order to determine using the interleaved modular multiplication algorithm or not. However, the check condition absolutely depends on the implementation of pseudorandom_perm() even though modular_multiply() is a general function. I think it is better that pseudorandom_perm() directly checks the condition because it can be worked efficiently and modular_multiply() can be used in general purpose. (2) pseudorandom_perm() and 001_pgbench_with_server.pl Reading the discussion of __builtin_xxx functions, I remembered another overflow issue. pseudorandom_perm() uses the Donald Knuth's linear congruential generator method to obtain pseudo-random numbers. This method, not only this but also all linear congruential generator methods, always overflows. If we do not need to guarantee the portability of this code, we do not care about the result of this method because we just use *pseudo* random numbers. (In fact, I ignored it before.) However, since we have to guarantee the portability, we have to calculate it accurately. I therefore implemented the function dk_lcg() and rewrote pseudorandom_perm(). Just to be sure, I made a python code to check the algorithm of pseudorandom_perm() and run it. Fortunately, my python code and pseudorandom_perm() which I rewrote returned the same answers, so I rewrote the answers in 001_pgbench_with_server.pl. I attached the new patch `pgbench-prp-func-7.patch`, the python code `pseudorandom_perm.py`, and the pr_perm check script file `pr_perm_check.sql`. Best regards, On 2018/10/01 12:19, Fabien COELHO wrote: PostgreSQL Hackers Subject: Re: pgbench - add pseudo-random permutation function On Wed, Sep 26, 2018 at 01:20:49PM +0200, Fabien COELHO wrote: So, am I right to deducing that you are satisfied with the current status of the patch, with the nbits implementation either based on popcount (v4) or clz (v5) compiler intrinsics? I think that the clz option is better. Fabien, please note that v5 does not apply, Indeed, tests interact with 92a0342 committed on Thursday. Here is a rebase of the latest version based on CLZ. According to basic test I made, the underlying hardware instruction seems to be more often available. so I moved this patch to next CF, waiting on author. I'm going to change its state to "Needs review". \set a1 pr_perm(0, 1000, 54321) \set a2 pr_perm(1999, 1000, 54321) \set a3 pr_perm(2500, 1000, 54321) BEGIN; DROP TABLE IF EXISTS test; CREATE TABLE test (id int, data bigint); INSERT INTO test VALUES(1, :a1); INSERT INTO test VALUES(2, :a2); INSERT INTO test VALUES(3, :a3); END; #!/usr/bin/python PRP_ROUNDS = 4 DK_LCG_MUL = 6364136223846793005 DK_LCG_INC = 1442695040888963407 LCG_SHIFT = 13 PRP_PRIMES = 16 primes = [ 8388617, 8912921, 9437189, 9961487, 10485767, 11010059, 11534351, 12058679, 12582917, 13107229, 13631489, 14155777, 14680067, 15204391, 15728681, 16252967 ] def compute_mask(n): n = n|(n>>1) n = n|(n>>2) n = n|(n>>4) n = n|(n>>8) n = n|(n>>16) n = n|(n>>32) return n def dk_lcg (key): ret = (key * DK_LCG_MUL + DK_LCG_INC) % 0x return ret def psuderandom_perm(data, size, seed): v = data % size key = seed mask = (compute_mask(size - 1)>>1) if (size == 1): return 0 p = key % PRP_PRIMES for i in range(0, PRP_ROUNDS): key = dk_lcg(key) if (v <= mask): v ^= ((key >> LCG_SHIFT) & mask) key = dk_lcg(key) t = size - 1 - v if (t <= mask): t ^= (key >> LCG_SHIFT) & mask v = size - 1 - t while (size % primes[p] == 0): p = (p + 1) % PRP_PRIMES key = dk_lcg(key) v = ((primes[p] * v) + (key >> LCG_SHIFT)) % size p = (p + 1) % PRP_PRIMES return v def print_ret(v,size,seed,ret): s = 'pr_perm(' + str(v) + ', ' + str(size) + ', ' + str(seed) + ') = ' + str(ret) print s ## main for i in range(0, 11): r = psuderandom_perm(i, 11, 5432) print_ret(i,11,5432,r) q = [0, 1999, 2500] for i in q: r = psuderandom_perm(i, 1000, 54321) print_ret(i, 1000, 54321, r) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index af2dea1..5b09f73 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -327,6 +327,26 @@ fi])# PGAC_C_BUILTIN_BSWAP64 +# PGAC_C_BUILTIN_CLZLL +# - +# Check if the C compiler understands __builtin_clzll(), +# and define HAVE__BUILTIN_CLZLL if so. +# Both GCC & CLANG seem to have one. +AC_DEFUN([PGAC_C_BUILTIN_CLZLL], +[AC_CACHE_CHECK(for __builtin_clzll, pgac_cv__builtin_clzll, +[AC_COMPILE_IFELSE([AC_LANG_SOURCE( +[static unsigned long
Re: "SELECT ... FROM DUAL" is not quite as silly as it appears
I wrote: > * There's a hack in nodeResult.c to prevent the executor from crashing > on a whole-row Var for an RTE_RESULT RTE, which is something that the > planner will create in SELECT FOR UPDATE cases, because it thinks it > needs to provide a ROW_MARK_COPY image of the RTE's output. We might > be able to get rid of that if we could teach the planner that it need > not bother rowmarking RESULT RTEs, but that seemed like it would be > really messy. (At the point where the decision is made, we don't know > whether a subquery might end up as just a RESULT, or indeed vanish > entirely.) Since I couldn't measure any reproducible penalty from > having the extra setup cost for a Result plan, I left it like this. Well, I'd barely sent this when I realized that there was a better way. The nodeResult.c hack predates my decision to postpone cleaning up RTE_RESULT RTEs till near the end of the preprocessing phase, and given that code, it is easy to get rid of rowmarking RESULT RTEs ... in fact, the code was doing it already, except in the edge case of a SELECT with only a RESULT RTE. So here's a version that does not touch nodeResult.c. regards, tom lane diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 33f9a79..efa5596 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -2404,6 +2404,8 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable) case RTE_NAMEDTUPLESTORE: APP_JUMB_STRING(rte->enrname); break; + case RTE_RESULT: +break; default: elog(ERROR, "unrecognized RTE kind: %d", (int) rte->rtekind); break; diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 21a2ef5..ac3722a 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -5374,7 +5374,7 @@ INSERT INTO ft2 (c1,c2,c3) VALUES (1200,999,'foo') RETURNING tableoid::regclass; QUERY PLAN - Insert on public.ft2 - Output: (tableoid)::regclass + Output: (ft2.tableoid)::regclass Remote SQL: INSERT INTO "S 1"."T 1"("C 1", c2, c3, c4, c5, c6, c7, c8) VALUES ($1, $2, $3, $4, $5, $6, $7, $8) -> Result Output: 1200, 999, NULL::integer, 'foo'::text, NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2 '::character(10), NULL::user_enum diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c index 9e78421..8f7d4ba 100644 --- a/src/backend/executor/execAmi.c +++ b/src/backend/executor/execAmi.c @@ -437,9 +437,12 @@ ExecSupportsMarkRestore(Path *pathnode) return ExecSupportsMarkRestore(((ProjectionPath *) pathnode)->subpath); else if (IsA(pathnode, MinMaxAggPath)) return false; /* childless Result */ + else if (IsA(pathnode, GroupResultPath)) +return false; /* childless Result */ else { -Assert(IsA(pathnode, ResultPath)); +/* Simple RTE_RESULT base relation */ +Assert(IsA(pathnode, Path)); return false; /* childless Result */ } diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c index a10014f..ecaeeb3 100644 --- a/src/backend/nodes/nodeFuncs.c +++ b/src/backend/nodes/nodeFuncs.c @@ -2319,10 +2319,6 @@ range_table_walker(List *rtable, if (walker(rte->tablesample, context)) return true; break; - case RTE_CTE: - case RTE_NAMEDTUPLESTORE: -/* nothing to do */ -break; case RTE_SUBQUERY: if (!(flags & QTW_IGNORE_RT_SUBQUERIES)) if (walker(rte->subquery, context)) @@ -2345,6 +2341,11 @@ range_table_walker(List *rtable, if (walker(rte->values_lists, context)) return true; break; + case RTE_CTE: + case RTE_NAMEDTUPLESTORE: + case RTE_RESULT: +/* nothing to do */ +break; } if (walker(rte->securityQuals, context)) @@ -3150,10 +3151,6 @@ range_table_mutator(List *rtable, TableSampleClause *); /* we don't bother to copy eref, aliases, etc; OK? */ break; - case RTE_CTE: - case RTE_NAMEDTUPLESTORE: -/* nothing to do */ -break; case RTE_SUBQUERY: if (!(flags & QTW_IGNORE_RT_SUBQUERIES)) { @@ -3184,6 +3181,11 @@ range_table_mutator(List *rtable, case RTE_VALUES: MUTATE(newrte->values_lists, rte->values_lists, List *); break; + case RTE_CTE: + case RTE_NAMEDTUPLESTORE: + case RTE_RESULT: +/* nothing to do */ +break; } MUTATE(newrte->securityQuals, rte->securityQuals, List *); newrt
Re: WIP: Avoid creation of the free space map for small tables
On Sun, Oct 14, 2018 at 1:09 AM John Naylor wrote: > > On 10/13/18, Amit Kapila wrote: > > > I think you have found a good way to avoid creating FSM, but can't we > > use some simpler technique like if the FSM fork for a relation doesn't > > exist, then check the heapblk number for which we try to update the > > FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid > > creating the FSM. > > I think I see what you mean, but to avoid the vacuum problem you just > mentioned, we'd need to check the relation size, too. > Sure, but vacuum already has relation size. In general, I think we should try to avoid adding more system calls in the common code path. It can impact the performance. Few comments on your latest patch: - +static bool +allow_write_to_fsm(Relation rel, BlockNumber heapBlk) +{ + BlockNumber heap_nblocks; + + if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD || + rel->rd_rel->relkind != RELKIND_RELATION) + return true; + + /* XXX is this value cached? */ + heap_nblocks = RelationGetNumberOfBlocks(rel); + + if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD) + return true; + else + { + RelationOpenSmgr(rel); + return smgrexists(rel->rd_smgr, FSM_FORKNUM); + } +} I think you can avoid calling RelationGetNumberOfBlocks, if you call smgrexists before and for the purpose of vacuum, we can get that as an input parameter. I think one can argue for not changing the interface functions like RecordPageWithFreeSpace to avoid calling RelationGetNumberOfBlocks, but to me, it appears worth to save the additional system call. - targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); - - /* - * If the FSM knows nothing of the rel, try the last page before we - * give up and extend. This avoids one-tuple-per-page syndrome during - * bootstrapping or in a recently-started system. - */ if (targetBlock == InvalidBlockNumber) - { - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); - - if (nblocks > 0) - targetBlock = nblocks - 1; - } + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber, + &try_every_page); Is it possible to hide the magic of trying each block within GetPageWithFreeSpace? It will simplify the code and in future, if another storage API has a different function for RelationGetBufferForTuple, it will work seamlessly, provided they are using same FSM. One such user is zheap. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: B-tree cache prefetches
> 15 окт. 2018 г., в 2:38, Thomas Munro > написал(а): > > On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin wrote: >> 31 авг. 2018 г., в 2:40, Thomas Munro >> написал(а): >> A related topic is the cache-unfriendliness of traditional binary >> searches of sorted data. Check out "Array Layouts for >> Comparison-Based Searching"[1] if you haven't already. It says that >> if your array fits in L2 cache, as our btree pages do, then binary >> search still wins against Eytzinger et al, which I was a bit >> disappointed >> >> I've re-read that paper. Their results are not about our case, here's a >> quote: >>> For arrays small enough _to be kept_ in L2 cache, the branch-free binary >>> search code listed in Listing 2 is the fastest algorithm >> >> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic >> possibility, except one simple problem: I do not see how can we insert items >> into this layout. > > I don't know. Aren't we talking about binary search within one 8KB > page here, anyway? The paper discuss multiple searches inside one 8Kb region, while we are doing single search in all-uncached page and move away. That's the difference. Binserach is optimal, if the page is in L1 before we start search. > > Thinking about some other ideas for cache prefetching in btree code, > here's an idea to improve pipelining of index scans (a bit like the > two-step pipeline idea from the hash join prefetch thread). We know > the TIDs of heap tuples we'll be looking up in the near future, so can > we prefetch all the way to the tuple? There are three random > accesses that follow soon after reading an index tuple: > BufTableLookup(), then page header, then the tuple itself. At the > logical extreme, we could anticipate the probe of SharedBufHash for > the TID from index tuple i + 3 (perhaps dynahash could expose a > hash_prefetch_for_hash() facility), and anticipate the read of the > page header for the tuple pointed to by index tuple i + 2 (perhaps a > dirty read of SharedBufHash that is only prepared to look at the first > entry in the bucket would be OK for this), and anticipate the read of > the tuple for the TID from index tuple i + 1 (perhaps a dirty read of > the page item table). Or some less ambitious subset, if that pipeline > turns out to be too deep; it seems like the simplest experiment would > be to try prefetching just SharedBufHash lookups. This may all be > completely crazy, I haven't tried any of it. This idea also looks good to me. One thing bothers me. if we do __builtin_prefetch(&a[b[c[d]]],0,1), and a, b, c and d all are not in cache - will we incur CPU wait time for fetching a,b and c? This [0] guys are expoiting C++ coroutines for prefetching this kind of stuff, but it seems like too much of hassle. > >> Also, that research is quite distant from B-tree binsearch: thier data items >> are small and do not represent reference to actual data. Eytzinger exploits >> the fact that two posiible future keys share same cache line. But it is hard >> to provide, given we place items at quite random place of the page. > > Based on the the above quotes, I'm not suggesting we should use > Eytzinger for search-within-page. But out of curiosity, I checked, > and saw that you can actually get a pair of index tuples holding a > six-character text key or an integer key into a 64 byte cache line. Well, this proves that in theory Eytzinger is an opportunity. Best regards, Andrey Borodin. [0] http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf
Re: PostgreSQL 11 RC1 + GA Dates
Hi, On Tue, Oct 2, 2018 at 7:28 PM Jonathan S. Katz wrote: > Hi, > > Based on the current status of the open items and where we are at in the > release cycle, the date for the first release candidate of PostgreSQL 11 > will be 2018-10-11. > > If all goes well with RC1, the PostgreSQL 11.0 GA release will be > 2018-10-18. This is subject to change if we find any issues during the > RC1 period that indicate we need to make an additional release candidate > prior to going GA. > > So, is 11.0 GA release planned in this week? > To the entire community, thank you for all of your hard work on > developing PostgreSQL 11. It is exciting that we are finally at this > point where we are ready to make our major release. > > Jonathan > > -- Sandeep Thakkar
Re: TupleTableSlot abstraction
On Sat, 13 Oct 2018 at 04:02, Andres Freund wrote: > > > > @@ -2024,7 +2024,18 @@ FormIndexDatum(IndexInfo *indexInfo, > > > > Datum iDatum; > > > > boolisNull; > > > > > > > > - if (keycol != 0) > > > > + if (keycol < 0) > > > > + { > > > > + HeapTupleTableSlot *hslot = (HeapTupleTableSlot > > > > *)slot; > > > > + > > > > + /* Only heap tuples have system attributes. */ > > > > + Assert(TTS_IS_HEAPTUPLE(slot) || > > > > TTS_IS_BUFFERTUPLE(slot)); > > > > + > > > > + iDatum = heap_getsysattr(hslot->tuple, keycol, > > > > + > > > > slot->tts_tupleDescriptor, > > > > + > > > > &isNull); > > > > + } > > > > + else if (keycol != 0) > > > > { > > > > /* > > > >* Plain index column; get the value we need > > > > directly from the > > > > > > This now should access the system column via the slot, right? There's > > > other places like this IIRC. > > > > Done. In FormIndexDatum() and ExecInterpExpr(), directly calling > > slot_getsysattr() now. > > > > In ExecInterpExpr (), I am no longer using ExecEvalSysVar() now. I am > > planning to remove this definition since it would be a single line > > function just calling slot_getsysattr(). > > > > In build_ExecEvalSysVar(), ExecEvalSysVar() is still used, so I > > haven't removed the definition yet. I am planning to create a new > > LLVMValueRef FuncSlotGetsysattr, and use that instead, in > > build_ExecEvalSysVar(), or for that matter, I am thinking to revert > > back build_ExecEvalSysVar() and instead have that code inline as in > > HEAD. > > I'd just have ExecInterpExpr() continue calling ExecEvalSysVar. There's > no reason for it to be inline. Can you explain more why you think there should be a ExecEvalSysVar() definition ? As I mentioned earlier it would just call slot_getsysattr() and do nothing else. > And it's simpler for JIT than the alternative. You mean it would be simpler for JIT to call ExecEvalSysVar() than slot_getsysattr() ? I didn't get why it is simpler. Or are you talking considering build_ExecEvalSysVar() ? I am ok with retaining build_ExecEvalSysVar() , but I was saying even inside this function, we could do : LLVMBuildCall( , llvm_get_decl(mod, FuncSlotGetsysattr) , .) rather than: LLVMFunctionType(,...) LLVMAddFunction("ExecEvalSysVar", ) LLVMBuildCall(...) > > > > > @@ -185,6 +1020,7 @@ ExecResetTupleTable(List *tupleTable,/* tuple > > > > table */ > > > > { > > > > TupleTableSlot *slot = lfirst_node(TupleTableSlot, lc); > > > > > > > > + slot->tts_cb->release(slot); > > > > /* Always release resources and reset the slot to empty */ > > > > ExecClearTuple(slot); > > > > if (slot->tts_tupleDescriptor) > > > > @@ -240,6 +1076,7 @@ void > > > > ExecDropSingleTupleTableSlot(TupleTableSlot *slot) > > > > { > > > > /* This should match ExecResetTupleTable's processing of one slot > > > > */ > > > > + slot->tts_cb->release(slot); > > > > Assert(IsA(slot, TupleTableSlot)); > > > > ExecClearTuple(slot); > > > > if (slot->tts_tupleDescriptor) > > > > > > ISTM that release should be called *after* clearing the slot. > > > > I am copying here what I discussed about this in the earlier reply: > > > > I am not sure what was release() designed to do. Currently all of the > > implementers of this function are empty. > > So additional deallocations can happen in a slot. We might need this > e.g. at some point for zheap which needs larger, longer-lived, buffers > in slots. > > > Was it meant for doing > > ReleaseTupleDesc(slot->tts_tupleDescriptor) ? Or > > ReleaseBuffer(bslot->buffer) ? > > No. The former lives in generic code, the latter is in ClearTuple. > > > I think the purpose of keeping this *before* clearing the tuple might > > be because the clear() might have already cleared some handles that > > release() might need. > > It's just plainly wrong to call it this way round. Ok. > > > > I went ahead and did these changes, but for now, I haven't replaced > > ExecFetchSlotTuple() with ExecFetchSlotHeapTuple(). Instead, I > > retained ExecFetchSlotTuple() to be called for heap tuples, and added > > a new ExecFetchGenericSlotTuple() to be used with shouldFree. I don't > > like these names, but until we have concluded, I don't want to go > > ahead and replace all the numerous ExecFetchSlotTuple() calls with > > ExecFetchSlotHeapTuple(). > > Why not? I haven't gone ahead because I wanted to know if you are ok with the names. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company