RE: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows
> On Tue, Jun 12, 2018 at 1:09 PM, Tsunakawa, Takayuki > wrote: > > My colleague is now preparing a patch for that, which adds a function > ECPGFreeSQLDA() in libecpg.so. That thread is here: > > > https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA9 > 63A42097@G01JPEXMBKW03 > > Thanks. I will follow up on that thread. He's created a separate thread for a new CF entry here: https://commitfest.postgresql.org/18/1669/ > > I want some remedy for older releases. Our customer worked around this > problem by getting a libpq connection in their ECPG application and calling > PQfreemem(). That's an ugly kludge, and I don't want other users to follow > it. > > > > I don't see a problem with back-patching as-is, because existing users > who just call free() or don't call free() won't be affected. I think that > most serious applications somehow state their supported minor releases like > "this application supports (or is certified against) PostgreSQL 10.5 or > later", just like other applications support "RHEL 6.2 or later" or "Windows > XP Sp2 or later." > > If there is a consensus that we should do that then I'll back-patch, > but so far no one else has spoken up in support. I'll follow the community decision. But I'm afraid that not enough people will comment on this to call it a consensus, because this topic will not be interesting... FWIW, I thought back-patching would make committers' future burdon smaller thanks to the smaller difference in the code of multiple major versions. Regards Takayuki Tsunakawa
Re: Performance regression with PostgreSQL 11 and partitioning
On 12 June 2018 at 01:49, Robert Haas wrote: > On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane wrote: >> Robert Haas writes: >>> That being said, I don't mind a bit if you want to look for further >>> speedups here, but if you do, keep in mind that a lot of queries won't >>> even use partition-wise join, so all of the arrays will be of length >>> 1. Even when partition-wise join is used, it is quite likely not to >>> be used for every table in the query, in which case it will still be >>> of length 1 in some cases. So pessimizing nappinfos = 1 even slightly >>> is probably a regression overall. >> >> TBH, I am way more concerned about the pessimization introduced for >> every pre-existing usage of these functions by putting search loops >> into them at all. I'd like very much to revert that. If we can >> replace those with something along the line of root->index_array[varno] >> we'll be better off across the board. > > I think David's analysis is correct -- this doesn't quite work. We're > trying to identify whether a given varno is one of the ones to be > translated, and it's hard to come up with a faster solution than > iterating over a (very short) array of those values. One thing we > could do is have two versions of each function, or else an optimized > path for the very common nappinfos=1 case. I'm just not sure it would > be worthwhile. Traversing a short array isn't free, but it's pretty > cheap. So this is the current situation as far as I see it: We could go and add a new version of adjust_appendrel_attrs() and adjust_appendrel_attrs_mutator() that accept a Relids for the child rels rather than an array of AppendRelInfos. However, that's quite a lot of code duplication. We could perhaps cut down on duplication by having a callback function stored inside of adjust_appendrel_attrs_context which searches for the correct AppendRelInfo to use. However, it does not seem to be inline with simplifying the code. We've not yet heard back from Tom with more details about his root->index_array[varno] idea. I can't quite see how this is possible and for the moment I assume Tom misunderstood that it's the parent varno that's known, not the varno of the child. I've attached a patch which cleans up my earlier version and moves the setup of the append_rel_array into its own function instead of sneaking code into setup_simple_rel_arrays(). I've also now updated the comment above find_childrel_appendrelinfo(), which is now an unused function. I tried the following test case: CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (date); \o /dev/null select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || ' hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text || ' hours')::interval || ''');' from generate_Series(0,) x; \gexec \o SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00'; Patched Time: 190.989 ms Time: 187.313 ms Time: 190.239 ms Unpatched Time: 775.771 ms Time: 781.631 ms Time: 762.777 ms Is this good enough for v11? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v3-0001-Allow-direct-lookups-of-AppendRelInfo-by-child-re.patch Description: Binary data
Re: AtEOXact_ApplyLauncher() and subtransactions
On 16 June 2018 at 00:03, Amit Khandekar wrote: > The way I am implementing this can be seen in attached > apply_launcher_subtrans_WIP.patch. (check launcher.c changes). I > haven't started testing it yet though. Attached patch passes some basic testing I did. Will do some more testing, and some self-review and code organising, if required. I will also split the patch into two : one containing the main issue regarding subtransaction, and the other containing the other issue I mentioned earlier that shows up without subtransaction as well. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company apply_launcher_subtrans.patch Description: Binary data
Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"
On 16 June 2018 at 19:30, Amit Kapila wrote: > On Sat, Jun 16, 2018 at 10:44 AM, Amit Kapila wrote: >> Yeah, or perhaps disallow creation of any partial paths at the first >> place like in attached. This will save us some work as well. >> > > Attached patch contains test case as well. I have tried to come up > with some simple test, but couldn't come up with anything much simpler > than reported by Rajkumar, so decided to use the test case provided by > him. > Thanks for the patch. I had a look at it, and it looks good to me. One minor comment : +-- Parallel Append is not be used when the subpath depends on the outer param "is not be used" => "is not to be used" -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"
On 16 June 2018 at 10:44, Amit Kapila wrote: > On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane wrote: >> I wrote: >>> This appears to be the fault of commit ab7271677, whose authors I've cc'd: >>> the stanza starting at about allpaths.c:1672 is bullheadedly creating a >>> parallel path whether that's allowed or not. Fixing it might be as simple >>> as adding "rel->consider_parallel" to the conditions there. >> >> Oh, and while I'm bitching: the same commit has created this exceedingly >> dangerous coding pattern in create_append_path: >> >> pathnode->subpaths = list_concat(subpaths, partial_subpaths); >> >> foreach(l, subpaths) >> { >> Path *subpath = (Path *) lfirst(l); >> >> Because list_concat() modifies its first argument, this will usually >> have the effect that the foreach traverses the partial_subpaths as >> well as the main subpaths. But it's very unclear to the reader whether >> that's intended or not. Worse, if we had *only* partial subpaths so >> that subpaths was initially NIL, then the loop would fail to traverse >> the partial subpaths, resulting in inconsistent behavior. >> >> It looks to me like traversal of the partial subpaths is the right >> thing here, in which case we should do >> >> - foreach(l, subpaths) >> + foreach(l, pathnode->subpaths) >> >> or perhaps better >> >> - pathnode->subpaths = list_concat(subpaths, partial_subpaths); >> + pathnode->subpaths = subpaths = list_concat(subpaths, >> partial_subpaths); >> >> to make the behavior clear and consistent. >> > > I agree with your analysis and proposed change. However, I think in > practice, it might not lead to any bug as in the loop, we are > computing parallel_safety and partial_subpaths should be > parallel_safe. Will have a look at this soon. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Performance regression with PostgreSQL 11 and partitioning
Le 18/06/2018 à 10:46, David Rowley a écrit : > On 12 June 2018 at 01:49, Robert Haas wrote: >> On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane wrote: >>> Robert Haas writes: That being said, I don't mind a bit if you want to look for further speedups here, but if you do, keep in mind that a lot of queries won't even use partition-wise join, so all of the arrays will be of length 1. Even when partition-wise join is used, it is quite likely not to be used for every table in the query, in which case it will still be of length 1 in some cases. So pessimizing nappinfos = 1 even slightly is probably a regression overall. >>> >>> TBH, I am way more concerned about the pessimization introduced for >>> every pre-existing usage of these functions by putting search loops >>> into them at all. I'd like very much to revert that. If we can >>> replace those with something along the line of root->index_array[varno] >>> we'll be better off across the board. >> >> I think David's analysis is correct -- this doesn't quite work. We're >> trying to identify whether a given varno is one of the ones to be >> translated, and it's hard to come up with a faster solution than >> iterating over a (very short) array of those values. One thing we >> could do is have two versions of each function, or else an optimized >> path for the very common nappinfos=1 case. I'm just not sure it would >> be worthwhile. Traversing a short array isn't free, but it's pretty >> cheap. > > So this is the current situation as far as I see it: > > We could go and add a new version of adjust_appendrel_attrs() and > adjust_appendrel_attrs_mutator() that accept a Relids for the child > rels rather than an array of AppendRelInfos. However, that's quite a > lot of code duplication. We could perhaps cut down on duplication by > having a callback function stored inside of > adjust_appendrel_attrs_context which searches for the correct > AppendRelInfo to use. However, it does not seem to be inline with > simplifying the code. > > We've not yet heard back from Tom with more details about his > root->index_array[varno] idea. I can't quite see how this is possible > and for the moment I assume Tom misunderstood that it's the parent > varno that's known, not the varno of the child. > > I've attached a patch which cleans up my earlier version and moves the > setup of the append_rel_array into its own function instead of > sneaking code into setup_simple_rel_arrays(). I've also now updated > the comment above find_childrel_appendrelinfo(), which is now an > unused function. > > I tried the following test case: > > CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 > INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) > PARTITION BY RANGE (date); > \o /dev/null > select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench > FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || ' > hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text > || ' hours')::interval || ''');' > from generate_Series(0,) x; > \gexec > \o > > SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00'; > > Patched > > Time: 190.989 ms > Time: 187.313 ms > Time: 190.239 ms > > Unpatched > > Time: 775.771 ms > Time: 781.631 ms > Time: 762.777 ms > > Is this good enough for v11? It works pretty well with your last patch. IMHO, this issue should be addressed in v11. I used a pretty unrealistic test-case to show this regression but it appear with a reasonnable count of partitions, v11 is slower than v10 even with 10 partitions. Thanks a lot !
Runtime partition pruning for MergeAppend
Back in the v11 cycle, there was just not quite enough time to get the MergeAppend run-time partition pruning patch in. I've attached v24 of this patch. The only changes done from v23 [1] are to re-base the patch atop of current master. There's was a bit of churn in the partition pruning and run-time pruning code last week. I've re-aligned the new code to account for the work done there. I'll add this to the July 'fest. [1] https://www.postgresql.org/message-id/cakjs1f_yfc8vvld3mhp4pxhrgprkb9hqa4zv+5v5pko6f03...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v24-0001-Expand-run-time-partition-pruning-to-work-with-M.patch Description: Binary data
Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.
Hi, Below test case crashed, when set enable_partitionwise_aggregate to true. CREATE TABLE part (c1 INTEGER,c2 INTEGER,c3 CHAR(10)) PARTITION BY RANGE(c1); CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (MINVALUE) TO (500); CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (500) TO (1000); CREATE TABLE part_p3 PARTITION OF part FOR VALUES FROM (1000) TO (MAXVALUE); INSERT INTO part SELECT i,i % 250, to_char(i % 4, 'FM') FROM GENERATE_SERIES(1,1500,2)i; ANALYSE part; ALTER TABLE part_p1 SET (parallel_workers = 0); ALTER TABLE part_p2 SET (parallel_workers = 0); ALTER TABLE part_p3 SET (parallel_workers = 0); SET enable_partitionwise_join to on; set enable_partitionwise_aggregate to off; EXPLAIN (COSTS OFF) SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; set enable_partitionwise_aggregate to on; EXPLAIN (COSTS OFF) SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; /* postgres=# set enable_partitionwise_aggregate to off; SET postgres=# EXPLAIN (COSTS OFF) postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; QUERY PLAN Sort Sort Key: (avg(t2.c1)), (sum(t1.c1)) -> HashAggregate Group Key: t1.c1, t2.c1 Filter: ((sum(t1.c1) % '125'::bigint) = 0) -> Append -> Hash Join Hash Cond: (t1.c1 = t2.c1) -> Seq Scan on part_p1 t1 -> Hash -> Seq Scan on part_p1 t2 -> Hash Join Hash Cond: (t1_1.c1 = t2_1.c1) -> Seq Scan on part_p2 t1_1 -> Hash -> Seq Scan on part_p2 t2_1 -> Hash Join Hash Cond: (t1_2.c1 = t2_2.c1) -> Seq Scan on part_p3 t1_2 -> Hash -> Seq Scan on part_p3 t2_2 (21 rows) postgres=# postgres=# set enable_partitionwise_aggregate to on; SET postgres=# EXPLAIN (COSTS OFF) postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> \q */ --logfile TRAP: FailedAssertion("!(parallel_workers > 0)", File: "allpaths.c", Line: 1630) 2018-06-14 23:24:58.375 IST [69650] LOG: server process (PID 69660) was terminated by signal 6: Aborted 2018-06-14 23:24:58.375 IST [69650] DETAIL: Failed process was running: EXPLAIN (COSTS OFF) SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; --core.file Loaded symbols for /lib64/libnss_files.so.2 Core was generated by `postgres: edb postgres [local] EXPLAIN '. Program terminated with signal 6, Aborted. #0 0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu x/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu x/raise.c:64 #1 0x003dd2633c75 in abort () at abort.c:92 #2 0x00a326da in ExceptionalCondition (conditionName=0xc1a970 "!(parallel_workers > 0)", errorType=0xc1a426 "FailedAssertion", fileName=0xc1a476 "allpaths.c", lineNumber=1630) at assert.c:54 #3 0x00797bda in add_paths_to_append_rel (root=0x1d6ff08, rel=0x1d45d80, live_childrels=0x0) at allpaths.c:1630 #4 0x007d37e1 in create_partitionwise_grouping_paths (root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520, partially_grouped_rel=0x1d45d80, agg_costs=0x7ffceb18dd20, gd=0x0, patype=PARTITIONWISE_AGGREGATE_FULL, extra=0x7ffceb18dbe0) at planner.c:7120 #5 0x007ce58d in create_ordinary_grouping_paths (root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520, agg_costs=0x7ffceb18dd20, gd=0x0, extra=0x7ffceb18dbe0, partially_grouped_rel_p=0x7ffceb18dc70) at planner.c:4011 #6 0x007ce14b in create_grouping_paths (root=0x1d6ff08, input_rel=0x1da5380, target=0x1d446d0, target_parallel_safe=true, agg_costs=0x7ffceb18dd20, gd=0x0) at planner.c:3783 #7 0x007cb344 in grouping_planner (root=0x1d6ff08, inheritance_update=false, tuple_fraction=0) a
Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.
On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > Hi, > > Below test case crashed, when set enable_partitionwise_aggregate to true. > I will have a look over this. Thanks for reporting. > > CREATE TABLE part (c1 INTEGER,c2 INTEGER,c3 CHAR(10)) PARTITION BY > RANGE(c1); > CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (MINVALUE) TO (500); > CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (500) TO (1000); > CREATE TABLE part_p3 PARTITION OF part FOR VALUES FROM (1000) TO > (MAXVALUE); > INSERT INTO part SELECT i,i % 250, to_char(i % 4, 'FM') FROM > GENERATE_SERIES(1,1500,2)i; > ANALYSE part; > > ALTER TABLE part_p1 SET (parallel_workers = 0); > ALTER TABLE part_p2 SET (parallel_workers = 0); > ALTER TABLE part_p3 SET (parallel_workers = 0); > > SET enable_partitionwise_join to on; > > set enable_partitionwise_aggregate to off; > EXPLAIN (COSTS OFF) > SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = > t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; > > set enable_partitionwise_aggregate to on; > EXPLAIN (COSTS OFF) > SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = > t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2; > > /* > postgres=# set enable_partitionwise_aggregate to off; > SET > postgres=# EXPLAIN (COSTS OFF) > postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON > (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY > 1,2; >QUERY PLAN > > Sort >Sort Key: (avg(t2.c1)), (sum(t1.c1)) >-> HashAggregate > Group Key: t1.c1, t2.c1 > Filter: ((sum(t1.c1) % '125'::bigint) = 0) > -> Append >-> Hash Join > Hash Cond: (t1.c1 = t2.c1) > -> Seq Scan on part_p1 t1 > -> Hash >-> Seq Scan on part_p1 t2 >-> Hash Join > Hash Cond: (t1_1.c1 = t2_1.c1) > -> Seq Scan on part_p2 t1_1 > -> Hash >-> Seq Scan on part_p2 t2_1 >-> Hash Join > Hash Cond: (t1_2.c1 = t2_2.c1) > -> Seq Scan on part_p3 t1_2 > -> Hash >-> Seq Scan on part_p3 t2_2 > (21 rows) > > postgres=# > postgres=# set enable_partitionwise_aggregate to on; > SET > postgres=# EXPLAIN (COSTS OFF) > postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON > (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY > 1,2; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !> \q > */ > > --logfile > TRAP: FailedAssertion("!(parallel_workers > 0)", File: "allpaths.c", > Line: 1630) > 2018-06-14 23:24:58.375 IST [69650] LOG: server process (PID 69660) was > terminated by signal 6: Aborted > 2018-06-14 23:24:58.375 IST [69650] DETAIL: Failed process was running: > EXPLAIN (COSTS OFF) > SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON > (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY > 1,2; > > > --core.file > Loaded symbols for /lib64/libnss_files.so.2 > Core was generated by `postgres: edb postgres [local] > EXPLAIN '. > Program terminated with signal 6, Aborted. > #0 0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu > x/raise.c:64 > 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > Missing separate debuginfos, use: debuginfo-install > keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 > libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 > openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64 > (gdb) bt > #0 0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu > x/raise.c:64 > #1 0x003dd2633c75 in abort () at abort.c:92 > #2 0x00a326da in ExceptionalCondition (conditionName=0xc1a970 > "!(parallel_workers > 0)", errorType=0xc1a426 "FailedAssertion", > fileName=0xc1a476 "allpaths.c", > lineNumber=1630) at assert.c:54 > #3 0x00797bda in add_paths_to_append_rel (root=0x1d6ff08, > rel=0x1d45d80, live_childrels=0x0) at allpaths.c:1630 > #4 0x007d37e1 in create_partitionwise_grouping_paths > (root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520, > partially_grouped_rel=0x1d45d80, > agg_costs=0x7ffceb18dd20, gd=0x0, patype=PARTITIONWISE_AGGREGATE_FULL, > extra=0x7ffceb18dbe0) at planner.c:7120 > #5 0x007ce58d in create_ordinary_grouping_paths (root=0x1d6ff08, > input_rel=0x1da5380, grouped_rel=0x1d43520, agg_costs=0x7ffceb18dd20,
Re: Slow planning time for simple query
On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth wrote: >> "Tom" == Tom Lane writes: > > Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill > Tom> bits, those bits could propagate to the standby anyway, as a > Tom> result of a subsequent WAL action on the index page that gets a > Tom> full-page image added. > > That's OK as long as we're ignoring those hints on the standby. > What if we don't ignore those hints on standby for a specific case like the one in get_actual_variable_range? Now, if the user has enabled wal_log_hints on the master, it could save time from scanning many dead tuples on standby. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Concurrency bug in UPDATE of partition-key
On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar wrote: > On Mon, Jun 18, 2018 at 10:21 AM, Amit Khandekar > wrote: >> Attached is v2 version of the patch. It contains the above >> trigger-related issue fixed. >> >> The updated tuple is passed back using the existing newslot parameter >> of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a >> new epqslot parameter, it means caller wants to skip the trigger >> execution, because the updated tuple needs to be again checked for >> constraints. I have added comments of this behaviour in the >> ExecBRDeleteTriggers() function header. >> > > Thanks for the updated patch. I have verified the BR trigger > behaviour, its working fine with the patch. > > + CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$ > + BEGIN > + RETURN OLD; > + END $$ LANGUAGE PLPGSQL; > + CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1 > + FOR EACH ROW EXECUTE PROCEDURE func_footrg(); > > Should we also create a test case where we can verify that some > unnecessary or duplicate triggers are not executed? > I am not sure how much value we will add by having such a test. In general, it is good to have tests that cover various aspects of functionality, but OTOH, we have to be careful to not overdo it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/14/2018 12:19 PM, Masahiko Sawada wrote: > On Wed, Jun 13, 2018 at 10:20 PM, Joe Conway wrote: >> The idea has not been extensively fleshed out yet, but the thought was >> that we create column level POLICY, which would transparently apply some >> kind of transform on input and/or output. The transforms would >> presumably be expressions, which in turn could use functions (extension >> or builtin) to do their work. That would allow encryption/decryption, >> DLP (data loss prevention) schemes (masking, redacting), etc. to be >> applied based on the policies. > > Which does this design encrypt data on, buffer or both buffer and > disk? The point of the design is simply to provide a mechanism for input and output transformation, not to provide the transform function itself. How you use that transformation would be entirely up to you, but if you were providing an encryption transform on input the data would be encrypted both buffer and disk. > And does this design (per-column encryption) aim to satisfy something > specific security compliance? Again, entirely up to you and dependent on what type of transformation you provide. If, for example you provided input encryption and output decryption based on some in memory session variable key, that would be essentially TDE and would satisfy several common sets of compliance requirements. >> This, in and of itself, would not address key management. There is >> probably a separate need for some kind of built in key management -- >> perhaps a flexible way to integrate with external systems such as Vault >> for example, or maybe something self contained, or perhaps both. > > I agree to have a flexible way in order to address different > requirements. I thought that having a GUC parameter to which we store > a shell command to get encryption key is enough but considering > integration with various key managements seamlessly I think that we > need to have APIs for key managements. (fetching key, storing key, > generating key etc) I don't like the idea of yet another path for arbitrary shell code execution. An API for extension code would be preferable. >> Or >> maybe key management is really tied into the separately discussed effort >> to create SQL VARIABLEs somehow. > > Could you elaborate on how key management is tied into SQL VARIABLEs? Well, the key management probably is not, but the SQL VARIABLE might be where the key is stored for use. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: Possible bug in logical replication.
On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote: > I confirm that starting reading WAL since restart_lsn as implemented in > f731cfa fixes this issue, as well as the second issue tushar mentioned > at [1]. Thanks! +/* + * Start reading WAL at restart_lsn, which certainly points to the valid + * record. + */ XLogRecPtrstartlsn = MyReplicationSlot->data.restart_lsn; XLogRecPtrretlsn = MyReplicationSlot->data.confirmed_flush; What does this one actually bring? PG_TRY(); { -/* restart at slot's confirmed_flush */ +/* start_lsn doesn't matter here, we don't replay xacts at all */ ctx = CreateDecodingContext(InvalidXLogRecPtr, NIL, true, Okay for this one. -/* - * The {begin_txn,change,commit_txn}_wrapper callbacks above will - * store the description into our tuplestore. - */ +/* Changes are not actually produced in fast_forward mode. */ This one is a good idea. Now CreateDecodingContext is missing the description of what fast_forward actually does, aka no changes are produced. Could you update your patch to reflect that? That would be useful for future callers of CreateDecodingContext as well. -/* Stop once the moving point wanted by caller has been reached */ -if (moveto <= ctx->reader->EndRecPtr) -break; - CHECK_FOR_INTERRUPTS(); It seems to me that we still want to have the slot forwarding finish in this case even if this is interrupted. Petr, isn't that the intention here? -- Michael signature.asc Description: PGP signature
Re: ON CONFLICT DO NOTHING on pg_dump
On Sat, Jun 16, 2018 at 11:36 AM, Dilip Kumar wrote: > > @@ -172,6 +172,7 @@ typedef struct _dumpOptions > char*outputSuperuser; > > int sequence_data; /* dump sequence data even in schema-only mode */ > + int do_nothing; > } DumpOptions; > > The new structure member appears out of place, can you move up along > with other "command-line long options" ? > > Done regards Surafel pg_dump_onConflect_v3.pach Description: Binary data
Re: Concurrency bug in UPDATE of partition-key
On 18 June 2018 at 17:56, Amit Kapila wrote: > On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar wrote: >> Should we also create a test case where we can verify that some >> unnecessary or duplicate triggers are not executed? >> > > I am not sure how much value we will add by having such a test. In > general, it is good to have tests that cover various aspects of > functionality, but OTOH, we have to be careful to not overdo it. Actually I am thinking, it's not a big deal adding a RAISE statement in trigger function in the existing testcases. It will clearly show how many times the trigger has executed. So I will go ahead and do that. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On Mon, Jun 18, 2018 at 1:20 AM, Ashutosh Bapat wrote: > That's a wrong comparison. Inheritance based partitioning works even > after declarative partitioning feature is added. So, users can > continue using inheritance based partitioning if they don't want to > move to declarative partitioning. That's not true here. A user creates > a BEFORE ROW trigger on each partition that mimics partitioned table > level BEFORE ROW trigger. The proposed BEFORE ROW trigger on > partitioned table will cascade the trigger down to each partition. If > that fails to recognize that there is already an equivalent trigger, a > partition table will get two triggers doing the same thing silently > without any warning or notice. That's fine if the trigger changes the > salaries to $50K but not OK if each of those increases salary by 10%. > The database has limited ability to recognize what a trigger is doing. I don't even know what to say about that. You are correct that if a user creates a new trigger on a partitioned table and doesn't check what happens to any existing triggers that they already have, bad things might happen. But that seems like a pretty stupid thing to do. If you make *any* kind of critical change to your production database without testing it, bad things may happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jun 13, 2018 at 9:20 AM, Joe Conway wrote: >> Also, if I understand correctly, at unconference session there also >> were two suggestions about the design other than the suggestion by >> Alexander: implementing TDE at column level using POLICY, and >> implementing TDE at table-space level. The former was suggested by Joe >> but I'm not sure the detail of that suggestion. I'd love to hear the >> deal of that suggestion. > > The idea has not been extensively fleshed out yet, but the thought was > that we create column level POLICY, which would transparently apply some > kind of transform on input and/or output. The transforms would > presumably be expressions, which in turn could use functions (extension > or builtin) to do their work. That would allow encryption/decryption, > DLP (data loss prevention) schemes (masking, redacting), etc. to be > applied based on the policies. It seems to me that column-level encryption is a lot less secure than block-level encryption. I am supposing here that the attack vector is stealing the disk. If all you've got is a bunch of 8192-byte blocks, it's unlikely you can infer much about the contents. You know the size of the relations and that's probably about it. If you've got individual values being encrypted, then there's more latitude to figure stuff out. You can infer something about the length of particular values. Perhaps you can find cases where the same encrypted value appears multiple times. If there's a btree index, you know the ordering of the values under whatever ordering semantics apply to that index. It's unclear to me how useful such information would be in practice or to what extent it might allow you to attack the underlying cryptography, but it seems like there might be cases where the information leakage is significant. For example, suppose you're trying to determine which partially-encrypted record is that of Aaron Aardvark... or this guy: https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff,_Sr. Recently, it was suggested to me that a use case for column-level encryption might be to prevent casual DBA snooping. So, you'd want the data to appear in pg_dump output encrypted, because the DBA might otherwise look at it, but you wouldn't really be concerned about the threat of the DBA loading a hostile C module that would steal user keys and use them to decrypt all the data, because they don't care that much and would be fired if they were caught doing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/18/2018 09:49 AM, Robert Haas wrote: > On Wed, Jun 13, 2018 at 9:20 AM, Joe Conway wrote: >>> Also, if I understand correctly, at unconference session there also >>> were two suggestions about the design other than the suggestion by >>> Alexander: implementing TDE at column level using POLICY, and >>> implementing TDE at table-space level. The former was suggested by Joe >>> but I'm not sure the detail of that suggestion. I'd love to hear the >>> deal of that suggestion. >> >> The idea has not been extensively fleshed out yet, but the thought was >> that we create column level POLICY, which would transparently apply some >> kind of transform on input and/or output. The transforms would >> presumably be expressions, which in turn could use functions (extension >> or builtin) to do their work. That would allow encryption/decryption, >> DLP (data loss prevention) schemes (masking, redacting), etc. to be >> applied based on the policies. > > It seems to me that column-level encryption is a lot less secure than > block-level encryption. I am supposing here that the attack vector is > stealing the disk. If all you've got is a bunch of 8192-byte blocks, > it's unlikely you can infer much about the contents. You know the > size of the relations and that's probably about it. Not necessarily. Our pages probably have enough predictable bytes to aid cryptanalysis, compared to user data in a column which might not be very predicable. > If you've got individual values being encrypted, then there's more > latitude to figure stuff out. You can infer something about the > length of particular values. Perhaps you can find cases where the > same encrypted value appears multiple times. This completely depends on the encryption scheme you are using, and the column level POLICY leaves that entirely up to you. But in any case most encryption schemes use a random nonce (salt) to ensure two identical strings do not encrypt to the same result. And often the encrypted length is padded, so while you might be able to infer short versus long, you would not usually be able to infer the exact plaintext length. > If there's a btree index, you know the ordering of the values under > whatever ordering semantics apply to that index. It's unclear to me > how useful such information would be in practice or to what extent it > might allow you to attack the underlying cryptography, but it seems > like there might be cases where the information leakage is > significant. For example, suppose you're trying to determine which > partially-encrypted record is that of Aaron Aardvark... or this guy: > https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff,_Sr. Again, this only applies if your POLICY uses this type of encryption, i.e. homomorphic encryption. If you use strong encryption you will not be indexing those columns at all, which is pretty commonly the case. > Recently, it was suggested to me that a use case for column-level > encryption might be to prevent casual DBA snooping. So, you'd want > the data to appear in pg_dump output encrypted, because the DBA might > otherwise look at it, but you wouldn't really be concerned about the > threat of the DBA loading a hostile C module that would steal user > keys and use them to decrypt all the data, because they don't care > that much and would be fired if they were caught doing it. Again completely dependent on the extension you use to do the encryption for the input policy. The keys don't need to be stored with the data, and the decryption can be transparent only for certain users or if certain session variables exist which the DBA does not have access to. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: Slow planning time for simple query
Amit Kapila writes: > On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth > wrote: >> That's OK as long as we're ignoring those hints on the standby. > What if we don't ignore those hints on standby for a specific case > like the one in get_actual_variable_range? Yeah, that's the same idea I suggested upthread. Andrew shot down my first thought (correctly I think) but the second one still seems feasible. > Now, if the user has > enabled wal_log_hints on the master, it could save time from scanning > many dead tuples on standby. We should have the standby set the hint bits for itself, rather than relying on wal_log_hints. regards, tom lane
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: > Not necessarily. Our pages probably have enough predictable bytes to aid > cryptanalysis, compared to user data in a column which might not be very > predicable. Really? I would guess that the amount of entropy in a page is WAY higher than in an individual column value. > But in any case most encryption schemes use a random nonce (salt) to > ensure two identical strings do not encrypt to the same result. And > often the encrypted length is padded, so while you might be able to > infer short versus long, you would not usually be able to infer the > exact plaintext length. Sure, that could be done, although it means that equality comparisons must be done unencrypted. > Again completely dependent on the extension you use to do the encryption > for the input policy. The keys don't need to be stored with the data, > and the decryption can be transparent only for certain users or if > certain session variables exist which the DBA does not have access to. Not arguing with that. And to be clear, I'm not trying to attack your proposal. I'm just trying to have a discussion about advantages and disadvantages of different approaches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/18/2018 10:26 AM, Robert Haas wrote: > On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: >> Not necessarily. Our pages probably have enough predictable bytes to aid >> cryptanalysis, compared to user data in a column which might not be very >> predicable. > > Really? I would guess that the amount of entropy in a page is WAY > higher than in an individual column value. It isn't about the entropy of the page overall, it is about the predictability of specific bytes at specific locations on the pages. At least as far as I understand it. >> But in any case most encryption schemes use a random nonce (salt) to >> ensure two identical strings do not encrypt to the same result. And >> often the encrypted length is padded, so while you might be able to >> infer short versus long, you would not usually be able to infer the >> exact plaintext length. > > Sure, that could be done, although it means that equality comparisons > must be done unencrypted. Sure. Typically equality comparisons are done on other unencrypted attributes. Or if you need to do equality on encrypted columns, you can store non-reversible cryptographic hashes in a separate column. >> Again completely dependent on the extension you use to do the encryption >> for the input policy. The keys don't need to be stored with the data, >> and the decryption can be transparent only for certain users or if >> certain session variables exist which the DBA does not have access to. > > Not arguing with that. And to be clear, I'm not trying to attack your > proposal. I'm just trying to have a discussion about advantages and > disadvantages of different approaches. Understood. Ultimately we might want both page-level encryption and column level POLICY, as they are each useful for different use-cases. Personally I believe the former is more generally useful than the latter, but YMMV. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Robert Haas writes: > On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: >> Not necessarily. Our pages probably have enough predictable bytes to aid >> cryptanalysis, compared to user data in a column which might not be very >> predicable. > Really? I would guess that the amount of entropy in a page is WAY > higher than in an individual column value. Depending on the specifics of the encryption scheme, having some amount of known (or guessable) plaintext may allow breaking the cipher, even if much of the plaintext is not known. This is cryptology 101, really. At the same time, having to have a bunch of independently-decipherable short field values is not real secure either, especially if they're known to all be encrypted with the same key. But what you know or can guess about the plaintext in such cases would be target-specific, rather than an attack that could be built once and used against any PG database. regards, tom lane
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan wrote: > > I think it would be helpful if you could talk more about these > > regressions (and the wins). > > I think that the performance regressions are due to the fact that when > you have a huge number of duplicates today, it's useful to be able to > claim space to fit further duplicates from almost any of the multiple > leaf pages that contain or have contained duplicates. I'd hoped that > the increased temporal locality that the patch gets would more than > make up for that. As far as I can tell, the problem is that temporal > locality doesn't help enough. I saw that performance was somewhat > improved with extreme Zipf distribution contention, but it went the > other way with less extreme contention. The details are not that fresh > in my mind, since I shelved this patch for a while following limited > performance testing. > > The code could certainly use more performance testing, and more > general polishing. I'm not strongly motivated to do that right now, > because I don't quite see a clear path to making this patch useful. > But, as I said, I have an open mind about what the next step should > be. Way back when I was dabbling in this kind of endeavor, my main idea to counteract that, and possibly improve performance overall, was a microvacuum kind of thing that would do some on-demand cleanup to remove duplicates or make room before page splits. Since nbtree uniqueification enables efficient retail deletions, that could end up as a net win. I never got around to implementing it though, and it does get tricky if you don't want to allow unbounded latency spikes.
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/18/2018 10:52 AM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: >>> Not necessarily. Our pages probably have enough predictable bytes to aid >>> cryptanalysis, compared to user data in a column which might not be very >>> predicable. > >> Really? I would guess that the amount of entropy in a page is WAY >> higher than in an individual column value. > > Depending on the specifics of the encryption scheme, having some amount > of known (or guessable) plaintext may allow breaking the cipher, even > if much of the plaintext is not known. This is cryptology 101, really. Exactly > At the same time, having to have a bunch of independently-decipherable > short field values is not real secure either, especially if they're known > to all be encrypted with the same key. But what you know or can guess > about the plaintext in such cases would be target-specific, rather than > an attack that could be built once and used against any PG database. Again is dependent on the specific solution for encryption. In some cases you might do something like generate a single use random key, encrypt the payload with that, encrypt the single use key with the "global" key, append the two results and store. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Index Skip Scan
Hi all, I would like to start a discussion on Index Skip Scan referred to as Loose Index Scan in the wiki [1]. My use-case is the simplest form of Index Skip Scan (B-Tree only), namely going from CREATE TABLE t1 (a integer PRIMARY KEY, b integer); CREATE INDEX idx_t1_b ON t1 (b); INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 1000) as i); ANALYZE; EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; HashAggregate (cost=169247.71..169247.74 rows=3 width=4) (actual time=4104.099..4104.099 rows=3 loops=1) Output: b Group Key: t1.b Buffers: shared hit=44248 -> Seq Scan on public.t1 (cost=0.00..144247.77 rows=977 width=4) (actual time=0.059..1050.376 rows=1000 loops=1) Output: a, b Buffers: shared hit=44248 Planning Time: 0.157 ms Execution Time: 4104.155 ms (9 rows) to CREATE TABLE t1 (a integer PRIMARY KEY, b integer); CREATE INDEX idx_t1_b ON t1 (b); INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 1000) as i); ANALYZE; EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4) (actual time=0.061..0.137 rows=3 loops=1) Output: b Heap Fetches: 3 Buffers: shared hit=13 Planning Time: 0.155 ms Execution Time: 0.170 ms (6 rows) I took Thomas Munro's previous patch [2] on the subject, added a GUC, a test case, documentation hooks, minor code cleanups, and made the patch pass an --enable-cassert make check-world run. So, the overall design is the same. However, as Robert Haas noted in the thread there are issues with the patch as is, especially in relationship to the amcanbackward functionality. A couple of questions to begin with. Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should a new node (T_IndexSkipScan) be created ? If latter, then there likely will be functionality that needs to be refactored into shared code between the nodes. Which is the best way to deal with the amcanbackward functionality ? Do people see another alternative to Robert's idea of adding a flag to the scan. I wasn't planning on making this a patch submission for the July CommitFest due to the reasons mentioned above, but can do so if people thinks it is best. The patch is based on master/4c8156. Any feedback, suggestions, design ideas and help with the patch in general is greatly appreciated. Thanks in advance ! [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com Best regards, Jesper diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index 6b2b9e3742..74ed15bfeb 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS) amroutine->ambulkdelete = blbulkdelete; amroutine->amvacuumcleanup = blvacuumcleanup; amroutine->amcanreturn = NULL; + amroutine->amskip = NULL; amroutine->amcostestimate = blcostestimate; amroutine->amoptions = bloptions; amroutine->amproperty = NULL; diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b60240ecfe..d03d64a4bc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3769,6 +3769,22 @@ ANY num_sync ( + enable_indexskipscan (boolean) + + enable_indexskipscan configuration parameter + + + + +Enables or disables the query planner's use of index-skip-scan plan +types (see ). This parameter requires +that enable_indexonlyscan is on. +The default is on. + + + + enable_material (boolean) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 24c3405f91..bd6a2a7b93 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -135,6 +135,7 @@ typedef struct IndexAmRoutine amendscan_function amendscan; ammarkpos_function ammarkpos; /* can be NULL */ amrestrpos_function amrestrpos; /* can be NULL */ +amskip_function amskip; /* can be NULL */ /* interface functions to support parallel index scans */ amestimateparallelscan_function amestimateparallelscan;/* can be NULL */ @@ -664,6 +665,14 @@ amrestrpos (IndexScanDesc scan); +bool +amskip (IndexScanDesc scan, ScanDirection direction, int prefix); + + TODO + + + + Size amestimateparallelscan (void); diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 14a1aa56cb..5fd9f81f23 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1312,6 +1312,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success; such cases and allow index-only scans to be generated, but older versions will not. + + +Index Skip Scans + + + index + index-skip scans + + + index-skip scan + + + + TODO + +
Re: ON CONFLICT DO NOTHING on pg_dump
On Fri, Jun 15, 2018 at 02:20:21AM +, Ideriha, Takeshi wrote: > >From: Nico Williams [mailto:n...@cryptonector.com] > >On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote: > >> Only the difference of data can be restored. > > > >But that's additive-only. Only missing rows are restored this way, and > >differences are > >not addressed. > > > >If you want restore to restore data properly and concurrently (as opposed to > >renaming > >a new database into place or whatever) then you'd want a) MERGE, b) dump to > >generate MERGE statements. A concurrent data restore operation would be > >rather > >neat. > > I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems hard > work. > Only ON-CONCLICT-DO-NOTHING use case may be narrow. Is it narrow, or is it just easy enough to add quickly? And by the way, you don't need MERGE. You can just generate INSERT/ UPDATE/DELETE statements -- MERGE is mainly an optimization on that, and could wait until PG has a MERGE. Nico --
Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.
On 2018-06-18 17:10:12 +0530, Jeevan Chalke wrote: > On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > > > Hi, > > > > Below test case crashed, when set enable_partitionwise_aggregate to true. > > > > I will have a look over this. > > Thanks for reporting. I've added an v11 open-items entry. - Andres
Re: [PATCH] Find additional connection service files in pg_service.conf.d directory
Hello, On Thu, Mar 01, 2018 at 01:40:10PM -0500, Curt Tilmes wrote: > New patch limits files to ".conf". Isn't it worth to check errno for stat(), opendir() and readdir() calls? I think when some error occured in searchServiceFileDirectory() then the error "definition of service \"%s\" not found" will be raised, because group_found is false. It may confuse as it hides a real error. For example, if permission is denied to open the directory. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/18/2018 05:06 PM, Joe Conway wrote: On 06/18/2018 10:52 AM, Tom Lane wrote: Robert Haas writes: On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway wrote: Not necessarily. Our pages probably have enough predictable bytes to aid cryptanalysis, compared to user data in a column which might not be very predicable. Really? I would guess that the amount of entropy in a page is WAY higher than in an individual column value. Depending on the specifics of the encryption scheme, having some amount of known (or guessable) plaintext may allow breaking the cipher, even if much of the plaintext is not known. This is cryptology 101, really. Exactly At the same time, having to have a bunch of independently-decipherable short field values is not real secure either, especially if they're known to all be encrypted with the same key. But what you know or can guess about the plaintext in such cases would be target-specific, rather than an attack that could be built once and used against any PG database. Again is dependent on the specific solution for encryption. In some cases you might do something like generate a single use random key, encrypt the payload with that, encrypt the single use key with the "global" key, append the two results and store. Yeah, I suppose we could even have per-page keys, for example. One topic I haven't seen mentioned in this thread yet is indexes. That's a pretty significant side-channel, when built on encrypted columns. Even if the indexes are encrypted too, you can often deduce a lot of information from them. So what's the plan here? Disallow indexes on encrypted columns? Index encypted values directly? Something else? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On 2018-Jun-18, Ashutosh Bapat wrote: > That's a wrong comparison. Inheritance based partitioning works even > after declarative partitioning feature is added. So, users can > continue using inheritance based partitioning if they don't want to > move to declarative partitioning. That's not true here. A user creates > a BEFORE ROW trigger on each partition that mimics partitioned table > level BEFORE ROW trigger. The proposed BEFORE ROW trigger on > partitioned table will cascade the trigger down to each partition. If > that fails to recognize that there is already an equivalent trigger, a > partition table will get two triggers doing the same thing silently > without any warning or notice. That's fine if the trigger changes the > salaries to $50K but not OK if each of those increases salary by 10%. > The database has limited ability to recognize what a trigger is doing. I agree with Robert that nobody in their right minds would be caught by this problem by adding new triggers without thinking about it and without testing them. If you add a partitioned-table-level trigger to raise salaries by 10% but there was already one in the partition level that does the same thing, you'll readily notice that salaries have been increased by 21% instead. So like Robert I'm inclined to not change the wording in the documentation. What does worry me a little bit now, reading this discussion, is whether we've made the triggers in partitions visible enough. We'll have this problem once we implement BEFORE ROW triggers as proposed, and I think we already have this problem now. Let's suppose a user creates a duplicated after trigger: create table parent (a int) partition by range (a); create table child partition of parent for values from (0) to (100); create function noise() returns trigger language plpgsql as $$ begin raise notice 'nyaa'; return null; end; $$; create trigger trig_p after insert on parent for each row execute procedure noise(); create trigger trig_c after insert on child for each row execute procedure noise(); Now let's try it: alvherre=# insert into child values (1); NOTICE: nyaa NOTICE: nyaa INSERT 0 1 OK, so where does that one come from? alvherre=# \d child Tabla «public.child» Columna │ Tipo │ Collation │ Nullable │ Default ─┼─┼───┼──┼─ a │ integer │ │ │ Partition of: parent FOR VALUES FROM (0) TO (100) Triggers: trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise() Hmm, there's only one trigger here, why does it appear twice? To find out, you have to know where to look: alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger; tgname │ tgrelid │ tgisinternal ┼─┼── trig_p │ parent │ f trig_p │ child │ t trig_c │ child │ f (3 filas) So there is a trigger in table child, but it's hidden because tgisinternal. Of course, you can see it if you look at the parent's definition: alvherre=# \d parent Tabla «public.parent» Columna │ Tipo │ Collation │ Nullable │ Default ─┼─┼───┼──┼─ a │ integer │ │ │ Partition key: RANGE (a) Triggers: trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise() Number of partitions: 1 (Use \d+ to list them.) I think it'd be useful to have a list of triggers that have been inherited from ancestors, or maybe simply a list of internal triggers Or maybe this is not something to worry about? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire wrote: > Way back when I was dabbling in this kind of endeavor, my main idea to > counteract that, and possibly improve performance overall, was a > microvacuum kind of thing that would do some on-demand cleanup to > remove duplicates or make room before page splits. Since nbtree > uniqueification enables efficient retail deletions, that could end up > as a net win. That sounds like a mechanism that works a bit like _bt_vacuum_one_page(), which we run at the last second before a page split. We do this to see if a page split that looks necessary can actually be avoided. I imagine that retail index tuple deletion (the whole point of this project) would be run by a VACUUM-like process that kills tuples that are dead to everyone. Even with something like zheap, you cannot just delete index tuples until you establish that they're truly dead. I guess that the delete marking stuff that Robert mentioned marks tuples as dead when the deleting transaction commits. Maybe we could justify having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if they're visible to anyone, and if not recycle), because we at least know that the deleting transaction committed there. That is, they could be recently dead or dead, and it may be worth going to the extra trouble of checking which when we know that it's one of the two possibilities. -- Peter Geoghegan
Re: why partition pruning doesn't work?
Amit Langote writes: > [ 0001-Open-partitioned-tables-during-Append-initialization.patch ] I took a look at this. While I'm in agreement with the general idea of holding open the partitioned relations' relcache entries throughout the query, I do not like anything about this patch: * It seems to be outright broken for the case that any of the partitioned relations are listed in nonleafResultRelations. If we're going to do it like this, we have to open every one of the partrels regardless of that. (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations altogether, in favor of merging the related code in InitPlan with this. That existing code is already a mighty ugly wart, and this patch makes it worse by adding new, related warts elsewhere.) * You've got *far* too much intimate knowledge of the possible callers in ExecOpenAppendPartitionedTables. Personally, what I would have this function do is return a List of the opened Relation pointers, and add a matching function to run through such a List and close the entries again, and make the callers responsible for stashing the List pointer in an appropriate field in their planstate. Or maybe what we should do is drop ExecLockNonLeafAppendTables/ ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it. regards, tom lane
Re: Removing "Included attributes in B-tree indexes" section from docs
On 2018-Jun-17, Peter Geoghegan wrote: > On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera > wrote: > > I don't necessarily object to the proposed change, but I think you > > should generally wait a bit longer for others to react. > > What wait period do you think is appropriate in this case? One which includes at least half a working day in a different timezone. You asked mid-afternoon on a Friday in a timezone pretty far west. I think you could find people in their offices in Easter Island, but not many more. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Removing "Included attributes in B-tree indexes" section from docs
On Mon, Jun 18, 2018 at 10:21 AM, Alvaro Herrera wrote: > One which includes at least half a working day in a different timezone. > You asked mid-afternoon on a Friday in a timezone pretty far west. It was 11 am PST. I'll make a note about this. It won't happen again. -- Peter Geoghegan
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jun 11, 2018 at 06:22:22PM +0900, Masahiko Sawada wrote: > As per discussion at PGCon unconference, I think that firstly we need > to discuss what threats we want to defend database data against. If We call that a threat model. There can be many threat models, of course. > user wants to defend against a threat that is malicious user who > logged in OS or database steals an important data on datbase this > design TDE would not help. Because such user can steal the data by > getting a memory dump or by SQL. That is of course differs depending > on system requirements or security compliance but what threats do you > want to defend database data against? and why? This design guards (somewhat) againts the threat of the storage theft (e.g., because the storage is remote). It's a fine threat model to address, but it's also a lot easier to address in the filesystem or device drivers -- there's no need to do this in PostgreSQL itself except so as to support it on all platforms regardless of OS capabilities. Note that unless the pg_catalog is protected against manipulation by remote storage, then TDE for user tables might be possible to compromise. Like so: the attacker manipulates the pg_catalog to escalate privelege in order to obtain the TDE keys. This argues for full database encryption, not just specific tables or columns. But again, this is for the threat model where the storage is the threat. Another similar thread model is dump management, where dumps are sent off-site where untrusted users might read them, or even edit them in the hopes that they will be used for restores and thus compromise the database. This is most easily addressed by just encrypting the backups externally to PG. Threat models where client users are the threat are easily handled by PG's permissions system. I think any threat model where DBAs are not the threat is just not that interesting to address with crypto within postgres itself... Encryption to public keys for which postgres does not have private keys would be one way to address DBAs-as-the-thread, but this is easily done with an extension... A small amount of syntactic sugar might help: CREATE ROLE "bar" WITH (PUBLIC KEY "..."); CREATE TABLE foo ( name TEXT PRIMARY KEY, payload TEXT ENCRYPTED TO ROLE "bar" BOUND TO name ); but this is just syntactic sugar, so not that valuable. On the other hand, just a bit of syntactic sugar can help tick a feature checkbox, which might be very valuable for marketing reasons even if it's not valuable for any other reason. Note that encrypting the payload without a binding to the PK (or similar name) is very dangerous! So the encryption option would have to support some way to indicate what other plaintext to bind in (here the "name" column). Note also that for key management reasons it would be necessary to be able to write the payload as ciphertext rather than as to-be-encrypted TEXT. Lastly, for a symmetric encryption option one would need a remote oracle to do the encryption, which seems rather complicated, but in some cases may well perform faster. Nico --
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On Mon, Jun 18, 2018 at 9:59 AM, Alvaro Herrera wrote: > > alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger; > tgname │ tgrelid │ tgisinternal > ┼─┼── > trig_p │ parent │ f > trig_p │ child │ t > trig_c │ child │ f > (3 filas) > > So there is a trigger in table child, but it's hidden because > tgisinternal. Of course, you can see it if you look at the parent's > definition: > > alvherre=# \d parent >Tabla «public.parent» > Columna │ Tipo │ Collation │ Nullable │ Default > ─┼─┼───┼──┼─ > a │ integer │ │ │ > Partition key: RANGE (a) > Triggers: > trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise() > Number of partitions: 1 (Use \d+ to list them.) > > I think it'd be useful to have a list of triggers that have been > inherited from ancestors, or maybe simply a list of internal triggers > > Or maybe this is not something to worry about? For the main internal trigger, foreign key, we don't show the trigger on the relevant table but we do indicate its effect by showing the presence of the foreign key. We likewise need to show the effect of the inherited trigger on the child table. When viewing the output the display order of invocation should be retained (is it that way now?) as a primary goal - with the directed or inherited nature presentation dependent upon that. i.e., I would like to see "Parent Triggers:" and "Triggers:" sections if possible but if trig_c is going to be invoked before trig_p that won't work and having a single "Triggers:" section with "(parent)" somewhere in the trigger info printout would be preferred. David J.
Re: Removing "Included attributes in B-tree indexes" section from docs
On 2018-06-18 13:21:43 -0400, Alvaro Herrera wrote: > On 2018-Jun-17, Peter Geoghegan wrote: > > > On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera > > wrote: > > > I don't necessarily object to the proposed change, but I think you > > > should generally wait a bit longer for others to react. > > > > What wait period do you think is appropriate in this case? > > One which includes at least half a working day in a different timezone. > You asked mid-afternoon on a Friday in a timezone pretty far west. I > think you could find people in their offices in Easter Island, but not > many more. I think there's also a question of how much a patch is blocking you / others. A shorter question period is more understandable if it's step 3/40, rather than 1/1... Greetings, Andres Freund
Re: Index Skip Scan
Hi! On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen wrote: > I would like to start a discussion on Index Skip Scan referred to as > Loose Index Scan in the wiki [1]. Great, I glad to see you working in this! > However, as Robert Haas noted in the thread there are issues with the > patch as is, especially in relationship to the amcanbackward functionality. > > A couple of questions to begin with. > > Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should > a new node (T_IndexSkipScan) be created ? If latter, then there likely > will be functionality that needs to be refactored into shared code > between the nodes. Is skip scan only possible for index-only scan? I guess, that no. We could also make plain index scan to behave like a skip scan. And it should be useful for accelerating DISTINCT ON clause. Thus, we might have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, IndexOnlySkipScan. So, I don't think I like index scan nodes to multiply this way, and it would be probably better to keep number nodes smaller. But I don't insist on that, and I would like to hear other opinions too. > Which is the best way to deal with the amcanbackward functionality ? Do > people see another alternative to Robert's idea of adding a flag to the > scan. Supporting amcanbackward seems to be basically possible, but rather complicated and not very efficient. So, it seems to not worth implementing, at least in the initial version. And then the question should how index access method report that it supports both skip scan and backward scan, but not both together? What about turning amcanbackward into a function which takes (bool skipscan) argument? Therefore, this function will return whether backward scan is supported depending of whether skip scan is used. > I wasn't planning on making this a patch submission for the July > CommitFest due to the reasons mentioned above, but can do so if people > thinks it is best. The patch is based on master/4c8156. Please, register it on commitfest. If even there wouldn't be enough of time for this patch on July commitfest, it's no problem to move it. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Query Rewrite for Materialized Views (Postgres Extension)
> > Hope it is useful or interesting for someone! Questions or comments are >> very welcome. >> > > good idea. > > Regards > > Pavel > In a recent PgConf NYC presentation [1] I was talking about the technical hurdles to implementing materialized views that could be kept up to date at all times, and the benefits of having such a thing. Some use cases can be addressed with eventually-consistent derivative table structures (Vertica's projections, PipelineDB's continuous views, etc), but those methods rely on the source data never having deletes or updates, or confining those updates to the "hot" part of the source tables, so it generally works for time-series data, but not for other cases. It has occurred to me that Dave Fetter's work on ASSERTIONS [2] has common underpinnings with true continuous materialized views. In both cases, the creation of a system object causes the creations of insert/update/delete triggers on one or more existing tables. In the case of assertions, those triggers are run with the goal of raising an error if rows are returned from a query. In the case of a materialized view, those same triggers would be used to delete rows from a CMV and insert replacements rows. If we can get always-up-to-date materialized views, then Denty's work on query rewrite would have greatly enhanced utility. [1] https://postgresconf.org/conferences/2018/program/proposals/a-roadmap-to-continuous-materialized-views-b4644661-8d5a-4186-8c17-4fb82600e147 [2] http://databasedoings.blogspot.com/2018/06/ive-posted-my-slides-for-my-asssertions.html
Re: Removing "Included attributes in B-tree indexes" section from docs
On Mon, Jun 18, 2018 at 1:31 PM, Andres Freund wrote: > I think there's also a question of how much a patch is blocking you / > others. A shorter question period is more understandable if it's step > 3/40, rather than 1/1... Agreed. For non-critical stuff like this it seems like waiting 2 or 3 business days is a good idea. Doesn't sound like there's any actual controversy in this case, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan wrote: > > On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire > wrote: > > Way back when I was dabbling in this kind of endeavor, my main idea to > > counteract that, and possibly improve performance overall, was a > > microvacuum kind of thing that would do some on-demand cleanup to > > remove duplicates or make room before page splits. Since nbtree > > uniqueification enables efficient retail deletions, that could end up > > as a net win. > > That sounds like a mechanism that works a bit like > _bt_vacuum_one_page(), which we run at the last second before a page > split. We do this to see if a page split that looks necessary can > actually be avoided. > > I imagine that retail index tuple deletion (the whole point of this > project) would be run by a VACUUM-like process that kills tuples that > are dead to everyone. Even with something like zheap, you cannot just > delete index tuples until you establish that they're truly dead. I > guess that the delete marking stuff that Robert mentioned marks tuples > as dead when the deleting transaction commits. Maybe we could justify > having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if > they're visible to anyone, and if not recycle), because we at least > know that the deleting transaction committed there. That is, they > could be recently dead or dead, and it may be worth going to the extra > trouble of checking which when we know that it's one of the two > possibilities. Yes, but currently bt_vacuum_one_page does local work on the pinned page. Doing dead tuple deletion however involves reading the heap to check visibility at the very least, and doing it on the whole page might involve several heap fetches, so it's an order of magnitude heavier if done naively. But the idea is to do just that, only not naively.
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
On 2018-Jun-11, Antonin Houska wrote: > Arseny Sher wrote: > > Please see detailed description of the issues, tests which reproduce > > them and fixes in the attached patch. > > I've played with your tests and gdb for a while, both w/o and with your > patch. I think I can understand both problems. I could not invent simpler way > to fix them. > > One comment about the coding: since you introduce a new transaction list and > it's sorted by LSN, I think you should make the function AssertTXNLsnOrder() > more generic and use it to check the by_base_snapshot_lsn list too. For > example call it after the list insertion and deletion in > ReorderBufferPassBaseSnapshot(). I've been going over this patch also, and I've made a few minor edits of the patch and the existing code as I come to understand what it's all about. > Also I think it makes no harm if you split the diff into two, although both > fixes use the by_base_snapshot_lsn field. Please don't. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speedup of relation deletes during recovery
On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund wrote: > Hi, > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote: >> > + >> > + srels = palloc(sizeof(SMgrRelation) * ndelrels); >> > for (i = 0; i < ndelrels; i++) >> > - { >> > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); >> > + srels[i] = smgropen(delrels[i], InvalidBackendId); >> > >> > - smgrdounlink(srel, false); >> > - smgrclose(srel); >> > - } >> > + smgrdounlinkall(srels, ndelrels, false); >> > + >> > + for (i = 0; i < ndelrels; i++) >> > + smgrclose(srels[i]); >> > + pfree(srels); > > Hm. This will probably cause another complexity issue. If we just > smgropen() the relation will be unowned. Which means smgrclose() will > call remove_from_unowned_list(), which is O(open-relations). Which means > this change afaict creates a new O(ndrels^2) behaviour? > > It seems like the easiest fix for that would be to have a local > SMgrRelationData in that loop, that we assign the relations to? That's > a bit hacky, but afaict should work? The easier (but tricky) fix for that is to call smgrclose() for each SMgrRelation in the reverse order. That is, we should do for (i = ndelrels - 1; i >= 0; i--) smgrclose(srels[i]); instead of for (i = 0; i < ndelrels; i++) smgrclose(srels[i]); Since add_to_unowned_list() always adds the entry into the head of the list, each SMgrRelation entry is added into the "unowned" list in descending order of its identifier, i.e., SMgrRelation entry with larger identifier should be placed ahead of one with smaller identifier. So if we calls remove_from_unowed_list() in descending order of SMgrRelation entry's identifier, the performance of remove_from_unowned_list()'s search should O(1). IOW, we can immediately find the SMgrRelation entry to remove, at the head of the list. Regards, -- Fujii Masao
Re: Speedup of relation deletes during recovery
On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev wrote: >> We just had a customer hit this issue. I kind of wonder whether this >> shouldn't be backpatched: Currently the execution on the primary is >> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the >> standby - with a lot higher constants to boot. That means it's very >> easy to get into situations where the standy starts to lag behind very >> significantly. > > +1, we faced with that too +1 to back-patch. As Horiguchi-san pointed out, this is basically the fix for oversight of commit 279628a0a7, not new feature. Regards, -- Fujii Masao
Re: Speedup of relation deletes during recovery
On 2018-06-19 03:06:54 +0900, Fujii Masao wrote: > On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund wrote: > > Hi, > > > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote: > >> > + > >> > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > >> > for (i = 0; i < ndelrels; i++) > >> > - { > >> > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > >> > + srels[i] = smgropen(delrels[i], InvalidBackendId); > >> > > >> > - smgrdounlink(srel, false); > >> > - smgrclose(srel); > >> > - } > >> > + smgrdounlinkall(srels, ndelrels, false); > >> > + > >> > + for (i = 0; i < ndelrels; i++) > >> > + smgrclose(srels[i]); > >> > + pfree(srels); > > > > Hm. This will probably cause another complexity issue. If we just > > smgropen() the relation will be unowned. Which means smgrclose() will > > call remove_from_unowned_list(), which is O(open-relations). Which means > > this change afaict creates a new O(ndrels^2) behaviour? > > > > It seems like the easiest fix for that would be to have a local > > SMgrRelationData in that loop, that we assign the relations to? That's > > a bit hacky, but afaict should work? > > The easier (but tricky) fix for that is to call smgrclose() for > each SMgrRelation in the reverse order. That is, we should do > >for (i = ndelrels - 1; i >= 0; i--) >smgrclose(srels[i]); > > instead of > >for (i = 0; i < ndelrels; i++) >smgrclose(srels[i]); We could do that - but add_to_unowned_list() is actually a bottleneck in other places during recovery too. We pretty much never (outside of dropping relations / databases) close opened relations during recovery - which is obviously problematic since nearly all of the entries are unowned. I've come to the conclusion that we should have a global variable that just disables adding anything to the global lists. Greetings, Andres Freund
Re: Query Rewrite for Materialized Views (Postgres Extension)
I commented to Corey (privately) that, while my rewrite extension has gotten me a server that responds quickly to aggregate queries, the constant need to refresh the supporting MVs means the system’s load average is constant and much higher than before. I’m happy with the tradeoff for now, but it’s a huge waste of energy, and I’m sure it must thrash my disk. I’m very interested in what other people think of Corey’s idea.
Re: Transform for pl/perl
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes: > [ 0001-Fix-excess-enreferencing-in-plperl-jsonb-transform.patch ] I tested this a bit more thoroughly by dint of applying Data::Dumper to the Perl values, and found that we were still getting extra references to sub-objects, for example INFO: $VAR1 = {'1' => \{'2' => \['3','4','5']},'2' => '3'}; where what we want is INFO: $VAR1 = {'1' => {'2' => ['3','4','5']},'2' => '3'}; That turns out to be because the newRV() call in JsonbValue_to_SV() is also superfluous, if we've set up refs around HV and AV scalars. Pushed with that change and the extra testing technology. I'll go push the dereferencing patch I proposed shortly, as well. The remaining unresolved issue in this thread is Ilmari's suggestion that we should convert integers to Perl IV (or UV?) if they fit, rather than always convert to NV as now. I'm inclined to reject that proposal, though, and not just because we don't have a patch for it. What's bothering me about it is that then the behavior would be dependent on the width of IV in the particular Perl installation. I think that is a platform dependency we can do without. regards, tom lane
Re: Query Rewrite for Materialized Views (Postgres Extension)
On Mon, Jun 18, 2018 at 07:38:13PM +0100, Dent John wrote: > I commented to Corey (privately) that, while my rewrite extension has > gotten me a server that responds quickly to aggregate queries, the > constant need to refresh the supporting MVs means the system’s load > average is constant and much higher than before. I’m happy with the > tradeoff for now, but it’s a huge waste of energy, and I’m sure it > must thrash my disk. > > I’m very interested in what other people think of Corey’s idea. I've written an alternative materialization extension (entirely as PlPgSQL) based on PG's internals, but my version has a few big wins that might help here. I'm thinking of properly integrating it with PG. Some of the features include: - you can write triggers that update the materialization This is because the materialization is just a regular table in my implementation. - you can mark a view as needing a refresh (e.g., in a trigger) - you can declare a PK, other constraints, and indexes on a materialization The DMLs used to refresh a view concurrently can take advantage of the PK and/or other indexes to go fast. - you get a history table which records updates to the materialization This is useful for generating incremental updates to external systems. Keeping track of refresh times should help decide whether to use or not use a materialization in some query, or whether to refresh it first, or not use it at all. One of the things I'd eventually like to do is analyze the view query AST to automatically generate triggers to update materializations or mark them as needing refreshes. A first, very very rough sketch of such an analysis looks like this: - if the view query has CTEs -> create triggers on all its table sources to mark the materialization as needing a refresh - else if a table appears more than once as a table source in the view query -> create triggers on that table that mark the materialization as needing a refresh - else if a table appears anywhere other than the top-level -> create triggers .. mark as needing refresh - else if a table is a right-side of a left join -> create triggers .. mark as needing refresh - else if a table has no PK -> create triggers .. mark as needing refresh - else if the query has no GROUP BY, or only does a GROUP BY on this table and a list of columns prefixed by the table's PK -> rewrite the query to have WHERE eq conditions on values for the table's PK columns analyze this query if the result shows this table source as the first table in the plan -> create triggers on this table to update the materialization directly from querying the source view - else -> create triggers .. mark as needing refresh Nico --
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov wrote: > I have written a code for quick indextuple deletion from an relation by heap > tuple TID. The code relate to "Retail IndexTuple deletion" enhancement of > btree index on postgresql wiki [1]. I knew that somebody would eventually read that Wiki page. :-) > Now, index relation cleaning performs by vacuum which scan all index > relation for dead entries sequentially, tuple-by-tuple. This simplistic and > safe method can be significantly surpasses in the cases, than a number of > dead entries is not large by retail deletions which uses index scans for > search dead entries. I assume that the lazy vacuum bulk delete thing is much faster for the situation where you have many dead tuples in the index. However, allowing a B-Tree index to accumulate so much bloat in the first place often has consequences that cannot be reversed with anything less than a REINDEX. It's true that "prevention is better than cure" for all types of bloat. However, this could perhaps be as much as 100x more important for B-Tree bloat, since we cannot place new tuples in any place that is convenient. It's very different to bloat in the heap, even at a high level. > The code demands hold DEAD tuple storage until scan key will be created. In > this version I add 'target_index_deletion_factor' option. If it more than 0, > heap_page_prune() uses ItemIdMarkDead() instead of ItemIdSetDead() function > for set DEAD flag and hold the tuple storage. Makes sense. > Next step is developing background worker which will collect pairs (tid, > scankey) of DEAD tuples from heap_page_prune() function. Makes sense. > | n | t1, s | t2, s | speedup | > |---|---| > | 0 | 0.3| 0.4476 | 1748.4 | > | 1 | 0.6| 0.5367 | 855.99 | > | 2 | 0.0004 | 0.9804 | 233.99 | > | 3 | 0.0048 | 1.6493 | 34.576 | > | 4 | 0.5600 | 2.4854 | 4.4382 | > | 5 | 3.3300 | 3.9555 | 1.2012 | > | 6 | 17.700 | 5.6000 | 0.3164 | > |---|---| > in the table, t1 - measured execution time of lazy_vacuum_index() function > by Quick-Vacuum strategy; t2 - measured execution time of > lazy_vacuum_index() function by Lazy-Vacuum strategy; The speedup looks promising. However, the real benefit should be in query performance, especially when we have heavy contention. Very eager retail index tuple deletion could help a lot there. It already makes sense to make autovacuum extremely aggressive in this case, to the point when it's running almost constantly. A more targeted cleanup process that can run much faster could do the same job, but be much more eager, and so be much more effective at *preventing* bloating of the key space [1][2]. > Note, guaranteed allowable time of index scans (used for quick deletion) > will be achieved by storing equal-key index tuples in physical TID order [2] > with patch [3]. I now have greater motivation to develop that patch into a real project. I bet that my heap-tid-sort patch will allow you to refine your interface when there are many logical duplicates: You can create one explicit scan key, but have a list of heap TIDs that need to be killed within the range of matching index tuples. That could be a lot more efficient in the event of many non-HOT updates, where most index tuple values won't actually change. You can sort the list of heap TIDs that you want to kill once, and then "merge" it with the tuples at the leaf level as they are matched/killed. It should be safe to avoid rechecking anything other than the heap TID values. [1] http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html [2] https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com -- Peter Geoghegan
Re: Column store in Greenplum
Please consider other Dbs like ClickHouse as well :) Im hoping PG one day performs 80% as well as these so no need to use them in first instance. -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Pluggable storage
@Amit Re: Vacuum etc. Chrome V8 just released this blog post around concurrent marking, which may be of interest considering how cpu limited the browser is. Contains benchmark numbers etc in post as well. https://v8project.blogspot.com/2018/06/concurrent-marking.html "This post describes the garbage collection technique called concurrent marking. The optimization allows a JavaScript application to continue execution while the garbage collector scans the heap to find and mark live objects. Our benchmarks show that concurrent marking reduces the time spent marking on the main thread by 60%–70%" -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Pluggable storage
On 2018-06-18 12:43:57 -0700, AJG wrote: > @Amit > > Re: Vacuum etc. > > Chrome V8 just released this blog post around concurrent marking, which may > be of interest considering how cpu limited the browser is. Contains > benchmark numbers etc in post as well. > > https://v8project.blogspot.com/2018/06/concurrent-marking.html > > "This post describes the garbage collection technique called concurrent > marking. The optimization allows a JavaScript application to continue > execution while the garbage collector scans the heap to find and mark live > objects. Our benchmarks show that concurrent marking reduces the time spent > marking on the main thread by 60%–70%" I don't see how in-memory GC techniques have much bearing on the discussion here? Greetings, Andres Freund
Re: Index Skip Scan
On 06/18/2018 11:25 AM, Jesper Pedersen wrote: Hi all, I would like to start a discussion on Index Skip Scan referred to as Loose Index Scan in the wiki [1]. awesome I wasn't planning on making this a patch submission for the July CommitFest due to the reasons mentioned above, but can do so if people thinks it is best. T New large features are not appropriate for the July CF. September should be your goal. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Removing "Included attributes in B-tree indexes" section from docs
On 06/18/2018 01:31 PM, Andres Freund wrote: On 2018-06-18 13:21:43 -0400, Alvaro Herrera wrote: On 2018-Jun-17, Peter Geoghegan wrote: On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera wrote: I don't necessarily object to the proposed change, but I think you should generally wait a bit longer for others to react. What wait period do you think is appropriate in this case? One which includes at least half a working day in a different timezone. You asked mid-afternoon on a Friday in a timezone pretty far west. I think you could find people in their offices in Easter Island, but not many more. I think there's also a question of how much a patch is blocking you / others. A shorter question period is more understandable if it's step 3/40, rather than 1/1... Yeah, but this would still apply for the first in a series of advertised patches. I usually try to wait at least a couple of days. I'm sure every committer understands how you can feel the itch to push. I know I do. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: libpq compression
tKonstantin Knizhnik writes: > On 06.06.2018 02:03, Thomas Munro wrote: >> On Wed, Jun 6, 2018 at 2:06 AM, Konstantin Knizhnik >> wrote: >>> Thank you for review. Updated version of the patch fixing all reported >>> problems is attached. >> Small problem on Windows[1]: >> >>C:\projects\postgresql\src\include\common/zpq_stream.h(17): error >> C2143: syntax error : missing ')' before '*' >> [C:\projects\postgresql\libpq.vcxproj] >> 2395 >> >> You used ssize_t in zpq_stream.h, but Windows doesn't have that type. >> We have our own typedef in win32_port.h. Perhaps zpq_stream.c should >> include postgres.h/postgres_fe.h (depending on FRONTEND) like the >> other .c files in src/common, before it includes zpq_stream.h? >> Instead of "c.h". >> >> [1] >> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.1106 >> > Thank you very much for reporting the problem. > I attached new patch with include of postgres_fe.h added to zpq_stream.c Hello! Due to being in a similar place, I'm offering some code review. I'm excited that you're looking at throughput on the network stack - it's not usually what we think of in database performance. Here are some first thoughts, which have some overlap with what others have said on this thread already: ### This build still doesn't pass Windows: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.2277 You can find more about what the bot is doing here: http://cfbot.cputube.org/ ### I have a few misgivings about pq_configure(), starting with the name. The *only* thing this function does is set up compression, so it's mis-named. (There's no point in making something generic unless it's needed - it's just confusing.) I also don't like that you've injected into the *startup* path - before authentication takes place. Fundamentally, authentication (if it happens) consists of exchanging some combination of short strings (e.g., usernames) and binary blobs (e.g., keys). None of this will compress well, so I don't see it as worth performing this negotiation there - it can wait. It's also another message in every startup. I'd leave it to connection parameters, personally, but up to you. ### Documentation! You're going to need it. There needs to be enough around for other people to implement the protocol (or if you prefer, enough for us to debug the protocol as it exists). In conjunction with that, please add information on how to set up compressed vs. uncompressed connections - similarly to how we've documentation on setting up TLS connection (though presumably compressed connection documentation will be shorter). ### Using terminology from https://facebook.github.io/zstd/zstd_manual.html : Right now you use streaming (ZSTD_{compress,decompress}Stream()) as the basis for your API. I think this is probably a mismatch for what we're doing here - we're doing one-shot compression/decompression of packets, not sending video or something. I think our use case is better served by the non-streaming interface, or what they call the "Simple API" (ZSTD_{decompress,compress}()). Documentation suggests it may be worth it to keep an explicit context around and use that interface instead (i.e., ZSTD_{compressCCTx,decompressDCtx}()), but that's something you'll have to figure out. You may find making this change helpful for addressing the next issue. ### I don't like where you've put the entry points to the compression logic: it's a layering violation. A couple people have expressed similar reservations I think, so let me see if I can explain using `pqsecure_read()` as an example. In pseudocode, `pqsecure_read()` looks like this: if conn->is_tls: n = tls_read(conn, ptr, len) else: n = pqsecure_raw_read(conn, ptr, len) return n I want to see this extended by your code to something like: if conn->is_tls: n = tls_read(conn, ptr, len) else: n = pqsecure_raw_read(conn, ptr, len) if conn->is_compressed: n = decompress(ptr, n) return n In conjunction with the above change, this should also significantly reduce the size of the patch (I think). ### The compression flag has proven somewhat contentious, as you've already seen on this thread. I recommend removing it for now and putting it in a separate patch to be merged later, since it's separable. ### It's not worth flagging style violations in your code right now, but you should be aware that there are quite a few style and whitespace problems. In particular, please be sure that you're using hard tabs when appropriate, and that line lengths are fewer than 80 characters (unless they contain error messages), and that pointers are correctly declared (`void *arg`, not `void* arg`). ### Thanks, --Robbie signature.asc Description: PGP signature
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Mon, Jun 18, 2018 at 4:59 PM Peter Geoghegan wrote: > > Note, guaranteed allowable time of index scans (used for quick deletion) > > will be achieved by storing equal-key index tuples in physical TID order [2] > > with patch [3]. > > I now have greater motivation to develop that patch into a real project. > > I bet that my heap-tid-sort patch will allow you to refine your > interface when there are many logical duplicates: You can create one > explicit scan key, but have a list of heap TIDs that need to be killed > within the range of matching index tuples. That could be a lot more > efficient in the event of many non-HOT updates, where most index tuple > values won't actually change. You can sort the list of heap TIDs that > you want to kill once, and then "merge" it with the tuples at the leaf > level as they are matched/killed. It should be safe to avoid > rechecking anything other than the heap TID values. > > [1] > http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html > [2] > https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com Actually, once btree tids are sorted, you can continue tree descent all the way to the exact leaf page that contains the tuple to be deleted. Thus, the single-tuple interface ends up being quite OK. Sure, you can optimize things a bit by scanning a range, but only if vacuum is able to group keys in order to produce the optimized calls, and I don't see that terribly likely. So, IMHO the current interface may be quite enough.
Re: WAL prefetch
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund wrote: >> The posix_fadvise approach is not perfect, no doubt about that. But it >> works pretty well for bitmap heap scans, and it's about 13249x better >> (rough estimate) than the current solution (no prefetching). > > Sure, but investing in an architecture we know might not live long also > has it's cost. Especially if it's not that complicated to do better. My guesses are: - Using OS prefetching is a very small patch. - Prefetching into shared buffers is a much bigger patch. - It'll be five years before we have direct I/O. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WAL prefetch
On 2018-06-18 16:44:09 -0400, Robert Haas wrote: > On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund wrote: > >> The posix_fadvise approach is not perfect, no doubt about that. But it > >> works pretty well for bitmap heap scans, and it's about 13249x better > >> (rough estimate) than the current solution (no prefetching). > > > > Sure, but investing in an architecture we know might not live long also > > has it's cost. Especially if it's not that complicated to do better. > > My guesses are: > > - Using OS prefetching is a very small patch. > - Prefetching into shared buffers is a much bigger patch. Why? The majority of the work is standing up a bgworker that does prefetching (i.e. reads WAL, figures out reads not in s_b, does prefetch). Allowing a configurable number + some synchronization between them isn't that much more work. > - It'll be five years before we have direct I/O. I think we'll have lost a significant market share by then if that's the case. Deservedly so. Greetings, Andres Freund
Re: [HACKERS] Statement-level rollback
On 2018-Jun-16, Robert Haas wrote: > I'm not sure that really solves the problem, because changing the GUC > in either direction causes the system to behave differently. I don't > see any particular reason to believe that changing the behavior from A > to B is any more or less likely to break applications than a change > from B to A. I suppose the other option is to just disallow a change during a running session completely. That is, whatever value is has when you connect is final. Maybe a better idea is to make write-once: the application connects, establishes its desired behavior, and then it cannot be changed anymore. > I put this feature, which - in this interest of full disclosure - is > already implemented in Advanced Server and has been for many years, > more or less in the same category as a hypothetical GUC that changes > case-folding from lower case to upper case. That is, it's really nice > for compatibility, but you can't get around the fact that every bit of > software that is supposed to run on all PostgreSQL instances has to be > tested in all possible modes, because otherwise you might find that it > doesn't work in all of those modes, or doesn't work as expected. It > is a behavior-changing GUC par excellence. Thanks for bringing this up. While I agree that both your example and the feature being proposed are behavior-changing, I don't think the parallel is very good, because the level of effort in order to port from one behavior to the other is much higher with statement-scoped rollback than with case-folding. In the case-folding example, I don't think you need to audit/rewrite all your applications in order to make them work: most of the time just rename a few tables, or if not just add a few quotes (and you can probably do it programatically.) With statement-scope rollback what you face is a more thorough review .. probably adding a savepoint call every other line. I'm not sure that for a large codebase this is even reasonable to start talking about. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Statement-level rollback
On Mon, Jun 18, 2018 at 4:51 PM, Alvaro Herrera wrote: > On 2018-Jun-16, Robert Haas wrote: >> I'm not sure that really solves the problem, because changing the GUC >> in either direction causes the system to behave differently. I don't >> see any particular reason to believe that changing the behavior from A >> to B is any more or less likely to break applications than a change >> from B to A. > > I suppose the other option is to just disallow a change during a running > session completely. That is, whatever value is has when you connect is > final. > > Maybe a better idea is to make write-once: the application connects, > establishes its desired behavior, and then it cannot be changed anymore. That sounds even worse. I think if we're going to have this behavior at all, it should be possible to change the setting. >> I put this feature, which - in this interest of full disclosure - is >> already implemented in Advanced Server and has been for many years, >> more or less in the same category as a hypothetical GUC that changes >> case-folding from lower case to upper case. That is, it's really nice >> for compatibility, but you can't get around the fact that every bit of >> software that is supposed to run on all PostgreSQL instances has to be >> tested in all possible modes, because otherwise you might find that it >> doesn't work in all of those modes, or doesn't work as expected. It >> is a behavior-changing GUC par excellence. > > Thanks for bringing this up. > > While I agree that both your example and the feature being proposed are > behavior-changing, I don't think the parallel is very good, because the > level of effort in order to port from one behavior to the other is much > higher with statement-scoped rollback than with case-folding. In the > case-folding example, I don't think you need to audit/rewrite all your > applications in order to make them work: most of the time just rename a > few tables, or if not just add a few quotes (and you can probably do it > programatically.) > > With statement-scope rollback what you face is a more thorough review .. > probably adding a savepoint call every other line. I'm not sure that > for a large codebase this is even reasonable to start talking about. Yeah. The real problem is what happens when two code bases collide. For example, suppose you have a large code base that is using this, and then you install some extensions that weren't tested with it enabled. Or, you install a tool like pgAdmin or pgpool or whatever that turns out not to understand the new mode, and stuff breaks. It's a distributed burden on the ecosystem. I don't think we can avoid that. It's just a matter of whether it is worth it or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Index Skip Scan
On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan wrote: > On 06/18/2018 11:25 AM, Jesper Pedersen wrote: > > I wasn't planning on making this a patch submission for the July > > CommitFest due to the reasons mentioned above, but can do so if people > > thinks it is best. T > > New large features are not appropriate for the July CF. September should > be your goal. Assuming this, should we have possibility to register patch to September CF from now? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Mon, Jun 18, 2018 at 1:42 PM, Claudio Freire wrote: > Actually, once btree tids are sorted, you can continue tree descent > all the way to the exact leaf page that contains the tuple to be > deleted. > > Thus, the single-tuple interface ends up being quite OK. Sure, you can > optimize things a bit by scanning a range, but only if vacuum is able > to group keys in order to produce the optimized calls, and I don't see > that terribly likely. Andrey talked about a background worker that did processing + index tuple deletion when handed off work by a user backend after it performs HOT pruning of a heap page. I consider that idea to be a good place to go with the patch, because in practice the big problem is workloads that suffer from so-called "write amplification", where most index tuples are created despite being "logically unnecessary" (e.g. one index among several prevents an UPDATE being HOT-safe, making inserts into most of the indexes "logically unnecessary"). I think that it's likely that only descending the tree once in order to kill multiple duplicate index tuples in one shot will in fact be *very* important (unless perhaps you assume that that problem is solved by something else, such as zheap). The mechanism that Andrey describes is rather unlike VACUUM as we know it today, but that's the whole point. -- Peter Geoghegan
Invisible Indexes
This is a MySQL feature, where an index is not considered by the planner. Implementing it should be fairly straightforward, adding a new boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would become a new unreserved keyword. The most obvious use case is to see what the planner does when the index is not visible, for example which other index(es) it might use. There are probably other cases where we might want an index to enforce a constraint but not to be used in query planning. So, do we want this feature? If we do I'll go ahead and prepare a patch. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 2:36 PM, Andrew Dunstan wrote: > > This is a MySQL feature, where an index is not considered by the planner. > Implementing it should be fairly straightforward, adding a new boolean to > pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would > become a new unreserved keyword. > So, do we want this feature? If we do I'll go ahead and prepare a patch. I know that it's definitely a feature that I want. Haven't thought about the syntax, though. -- Peter Geoghegan
Re: Invisible Indexes
On 18 June 2018 at 16:36, Andrew Dunstan wrote: > > This is a MySQL feature, where an index is not considered by the planner. > Implementing it should be fairly straightforward, adding a new boolean to > pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would > become a new unreserved keyword. > > The most obvious use case is to see what the planner does when the index is > not visible, for example which other index(es) it might use. There are > probably other cases where we might want an index to enforce a constraint > but not to be used in query planning. > > So, do we want this feature? If we do I'll go ahead and prepare a patch. > should pg_index.indisvalid works for this? in that case you only need the syntax for it... -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Transform for pl/perl
I wrote: > The remaining unresolved issue in this thread is Ilmari's suggestion > that we should convert integers to Perl IV (or UV?) if they fit, rather > than always convert to NV as now. Oh ... after re-reading the thread I realized there was one other point that we'd all forgotten about, namely the business about ~0 getting converted to -1 rather than what Perl interprets it as. Ilmari sent in a patch for that, against which I'd raised two complaints: 1. Possible compiler complaints about a constant-false comparison, on machines where type UV is 32 bits. 2. Need for secondary expected-output files, which'd be a pain to maintain. I realized that point 1 could be dealt with just by not trying to be smart, but always using the convert-to-text code path. Given that it seems to be hard to produce a UV value in Perl, I doubt it is worth working any harder than that. Also, point 2 could be dealt with in this perhaps crude way: -- this might produce either 18446744073709551615 or 4294967295 SELECT testUVToJsonb() IN ('18446744073709551615'::jsonb, '4294967295'::jsonb); Pushed with those adjustments. regards, tom lane
Re: Invisible Indexes
On 06/18/2018 05:46 PM, Jaime Casanova wrote: On 18 June 2018 at 16:36, Andrew Dunstan wrote: This is a MySQL feature, where an index is not considered by the planner. Implementing it should be fairly straightforward, adding a new boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would become a new unreserved keyword. The most obvious use case is to see what the planner does when the index is not visible, for example which other index(es) it might use. There are probably other cases where we might want an index to enforce a constraint but not to be used in query planning. So, do we want this feature? If we do I'll go ahead and prepare a patch. should pg_index.indisvalid works for this? in that case you only need the syntax for it... I thought about that. But I think these are more or less orthogonal. I doubt it will involve lots of extra code, though. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Invisible Indexes
On 06/18/2018 05:44 PM, Peter Geoghegan wrote: On Mon, Jun 18, 2018 at 2:36 PM, Andrew Dunstan wrote: This is a MySQL feature, where an index is not considered by the planner. Implementing it should be fairly straightforward, adding a new boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would become a new unreserved keyword. So, do we want this feature? If we do I'll go ahead and prepare a patch. I know that it's definitely a feature that I want. Well, that's encouraging ;-) Haven't thought about the syntax, though. I envisioned: CREATE INDEX [NOT VISIBLE] ...; ALTER INDEX ... [SET [NOT] VISIBLE] ...; Let the bikeshedding begin :-) cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov wrote: > Patch '0001-retail-indextuple-deletion' introduce new function > amtargetdelete() in access method interface. Patch > '0002-quick-vacuum-strategy' implements this function for an alternative > strategy of lazy index vacuum, called 'Quick Vacuum'. My compiler shows the following warnings: /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c: In function ‘bttargetdelete’: /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1053:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (needLock) ^~ /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1055:4: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’ npages = RelationGetNumberOfBlocks(irel); ^~ /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1074:3: warning: ‘blkno’ may be used uninitialized in this function [-Wmaybe-uninitialized] cleanup_block(info, stats, blkno); ^ I think that they're both harmless, though. -- Peter Geoghegan
Re: Invisible Indexes
Andrew Dunstan writes: > This is a MySQL feature, where an index is not considered by the > planner. Implementing it should be fairly straightforward, adding a new > boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I > guess VISIBLE would become a new unreserved keyword. > The most obvious use case is to see what the planner does when the index > is not visible, for example which other index(es) it might use. There > are probably other cases where we might want an index to enforce a > constraint but not to be used in query planning. Traditionally the way to do the former is begin; drop index unwanted; explain ; rollback; Admittedly, this isn't great in a production environment, but neither would be disabling the index in the way you suggest. I think the actually desirable way to handle this sort of thing is through an "index advisor" sort of plugin, which can hide a given index from the planner without any globally visible side-effects. I'm not sure about the "enforce constraint only" argument --- that sounds like a made-up use-case to me. It's pretty hard to imagine a case where a unique index applies to a query and yet you don't want to use it. > So, do we want this feature? If we do I'll go ahead and prepare a patch. On the whole I'm not excited about it, at least not with this approach. Have you considered an extension or GUC with only local side effects? regards, tom lane
Re: Invisible Indexes
On 2018-06-18 17:50:44 -0400, Andrew Dunstan wrote: > > > On 06/18/2018 05:46 PM, Jaime Casanova wrote: > > On 18 June 2018 at 16:36, Andrew Dunstan > > wrote: > > > This is a MySQL feature, where an index is not considered by the planner. > > > Implementing it should be fairly straightforward, adding a new boolean to > > > pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE > > > would > > > become a new unreserved keyword. > > > > > > The most obvious use case is to see what the planner does when the index > > > is > > > not visible, for example which other index(es) it might use. There are > > > probably other cases where we might want an index to enforce a constraint > > > but not to be used in query planning. > > > > > > So, do we want this feature? If we do I'll go ahead and prepare a patch. > > > > > should pg_index.indisvalid works for this? in that case you only need > > the syntax for it... > > > > > I thought about that. But I think these are more or less orthogonal. I > doubt it will involve lots of extra code, though. Be careful about that - currently it's not actually trivially possible to ever update pg_index rows. No, I'm not kidding you. pg_index.indexcheckxmin relies on the pg_index row's xmin. If you have ALTER do a non inplace update, you'll break things. Greetings, Andres Freund
Re: Invisible Indexes
Hi, On 2018-06-18 17:57:04 -0400, Tom Lane wrote: > Andrew Dunstan writes: > > This is a MySQL feature, where an index is not considered by the > > planner. Implementing it should be fairly straightforward, adding a new > > boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I > > guess VISIBLE would become a new unreserved keyword. > > > The most obvious use case is to see what the planner does when the index > > is not visible, for example which other index(es) it might use. There > > are probably other cases where we might want an index to enforce a > > constraint but not to be used in query planning. > > Traditionally the way to do the former is > > begin; > drop index unwanted; > explain ; > rollback; > > Admittedly, this isn't great in a production environment, but neither > would be disabling the index in the way you suggest. Yea, I don't think a global action - which'll at least take a something like a share-update-exclusive lock - is a suitable approach for this kinda thing. > I think the actually desirable way to handle this sort of thing is through > an "index advisor" sort of plugin, which can hide a given index from the > planner without any globally visible side-effects. Although I'm a bit doubtful that just shoving this into an extension is really sufficient. This is an extremely common task. Greetings, Andres Freund
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane wrote: > Admittedly, this isn't great in a production environment, but neither > would be disabling the index in the way you suggest. > > I think the actually desirable way to handle this sort of thing is through > an "index advisor" sort of plugin, which can hide a given index from the > planner without any globally visible side-effects. The globally visible side-effects are the point, though. Some users desire cheap insurance against dropping what turns out to be the wrong index. FWIW, this isn't just a MySQL feature. Oracle has a similar feature. -- Peter Geoghegan
Re: Invisible Indexes
Andres Freund writes: > On 2018-06-18 17:57:04 -0400, Tom Lane wrote: >> I think the actually desirable way to handle this sort of thing is through >> an "index advisor" sort of plugin, which can hide a given index from the >> planner without any globally visible side-effects. > Although I'm a bit doubtful that just shoving this into an extension is > really sufficient. This is an extremely common task. Well, what I was thinking about was that this functionality already exists (I think) in one or more "index advisor" plugins. It's possible that they've all bit-rotted for lack of support, which would not speak highly of the demand for the feature. But if we feel this is worth pulling into core, I think something along the lines of a GUC listing indexes to ignore for planning purposes might be a better design. It'd certainly dodge the issues you mentioned about lack of mutability of pg_index entries. regards, tom lane
Re: Invisible Indexes
On 2018-06-18 18:05:11 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-06-18 17:57:04 -0400, Tom Lane wrote: > >> I think the actually desirable way to handle this sort of thing is through > >> an "index advisor" sort of plugin, which can hide a given index from the > >> planner without any globally visible side-effects. > > > Although I'm a bit doubtful that just shoving this into an extension is > > really sufficient. This is an extremely common task. > > Well, what I was thinking about was that this functionality already > exists (I think) in one or more "index advisor" plugins. They're doing the opposite, right? I.e. they return "hypothetical indexes", which then can be used by the planner. None of the ones I've seen currently mask out an existing index. > It's possible that they've all bit-rotted for lack of support, which > would not speak highly of the demand for the feature. IDK, the DBA / developer crowd hitting issues like this isn't the same as the crowd willing to update an external plugin that doesn't even do quite what you want, and was more experimental than anything. Greetings, Andres Freund
Re: Invisible Indexes
Peter Geoghegan writes: > On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane wrote: >> I think the actually desirable way to handle this sort of thing is through >> an "index advisor" sort of plugin, which can hide a given index from the >> planner without any globally visible side-effects. > The globally visible side-effects are the point, though. Some users > desire cheap insurance against dropping what turns out to be the wrong > index. Perhaps there are use-cases where you want globally visible effects, but the primary use-case Andrew cited (i.e. EXPLAIN experimentation) would not want that. Anyway, if we do it with a GUC, the user can control the scope of the effects. regards, tom lane
Re: Invisible Indexes
On Tue, Jun 19, 2018 at 12:05 AM, Tom Lane wrote: > > Well, what I was thinking about was that this functionality already > exists (I think) in one or more "index advisor" plugins. It's possible > that they've all bit-rotted for lack of support, which would not speak > highly of the demand for the feature. But if we feel this is worth > pulling into core, I think something along the lines of a GUC listing > indexes to ignore for planning purposes might be a better design. > It'd certainly dodge the issues you mentioned about lack of mutability > of pg_index entries. I know only one extension which does exactly that: https://github.com/postgrespro/plantuner It seems that it's still maintained.
Re: Invisible Indexes
Andres Freund writes: > On 2018-06-18 18:05:11 -0400, Tom Lane wrote: >> Well, what I was thinking about was that this functionality already >> exists (I think) in one or more "index advisor" plugins. > They're doing the opposite, right? I.e. they return "hypothetical > indexes", which then can be used by the planner. None of the ones I've > seen currently mask out an existing index. I had the idea that some of them could also hide existing indexes. It's been awhile, so maybe my memory is faulty, but the hook we provide is capable of that: /* * Allow a plugin to editorialize on the info we obtained from the * catalogs. Actions might include altering the assumed relation size, * removing an index, or adding a hypothetical index to the indexlist. */ if (get_relation_info_hook) (*get_relation_info_hook) (root, relationObjectId, inhparent, rel); regards, tom lane
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 3:12 PM, Tom Lane wrote: > Perhaps there are use-cases where you want globally visible effects, > but the primary use-case Andrew cited (i.e. EXPLAIN experimentation) > would not want that. > > Anyway, if we do it with a GUC, the user can control the scope of > the effects. I had imagined that those use cases would be the most common. Dropping an index in production because it very much looks like it is unused is always a bit nerve-wracking in my experience. It's often hard to be 100% sure due to factors like replicas, the possible loss of statistic collector stats masking a problem, the possibility that there are very important queries that do use the index but are only run very infrequently, and so on. -- Peter Geoghegan
Re: Invisible Indexes
On 06/18/2018 06:12 PM, Tom Lane wrote: Peter Geoghegan writes: On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane wrote: I think the actually desirable way to handle this sort of thing is through an "index advisor" sort of plugin, which can hide a given index from the planner without any globally visible side-effects. The globally visible side-effects are the point, though. Some users desire cheap insurance against dropping what turns out to be the wrong index. Perhaps there are use-cases where you want globally visible effects, but the primary use-case Andrew cited (i.e. EXPLAIN experimentation) would not want that. Anyway, if we do it with a GUC, the user can control the scope of the effects. Yeah, but Peter makes the case that people want it for global experimentation. "We think we can safely drop this humungous index that would take us days to rebuild, but before we do let's make it invisible and run for a few days just to make sure." I guess we could do that with a GUC, but it seems ugly. To Andres' point about the fragility of pg_index, maybe we'd need a separate_catalog (pg_invisible_index)? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Invisible Indexes
Andrew Dunstan writes: > On 06/18/2018 06:12 PM, Tom Lane wrote: >> Anyway, if we do it with a GUC, the user can control the scope of >> the effects. > Yeah, but Peter makes the case that people want it for global > experimentation. "We think we can safely drop this humungous index that > would take us days to rebuild, but before we do let's make it invisible > and run for a few days just to make sure." I guess we could do that with > a GUC, but it seems ugly. I find it hard to believe that it's uglier than what you suggested... and it also does more, and is easier to implement. regards, tom lane
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On Mon, Jun 18, 2018 at 12:59 PM, Alvaro Herrera wrote: > On 2018-Jun-18, Ashutosh Bapat wrote: > >> That's a wrong comparison. Inheritance based partitioning works even >> after declarative partitioning feature is added. So, users can >> continue using inheritance based partitioning if they don't want to >> move to declarative partitioning. That's not true here. A user creates >> a BEFORE ROW trigger on each partition that mimics partitioned table >> level BEFORE ROW trigger. The proposed BEFORE ROW trigger on >> partitioned table will cascade the trigger down to each partition. If >> that fails to recognize that there is already an equivalent trigger, a >> partition table will get two triggers doing the same thing silently >> without any warning or notice. That's fine if the trigger changes the >> salaries to $50K but not OK if each of those increases salary by 10%. >> The database has limited ability to recognize what a trigger is doing. > > I agree with Robert that nobody in their right minds would be caught by > this problem by adding new triggers without thinking about it and > without testing them. If you add a partitioned-table-level trigger to > raise salaries by 10% but there was already one in the partition level > that does the same thing, you'll readily notice that salaries have been > increased by 21% instead. > > So like Robert I'm inclined to not change the wording in the > documentation. > > What does worry me a little bit now, reading this discussion, is whether > we've made the triggers in partitions visible enough. We'll have this > problem once we implement BEFORE ROW triggers as proposed, and I think > we already have this problem now. Let's suppose a user creates a > duplicated after trigger: > > create table parent (a int) partition by range (a); > create table child partition of parent for values from (0) to (100); > create function noise() returns trigger language plpgsql as $$ begin raise > notice 'nyaa'; return null; end; $$; > create trigger trig_p after insert on parent for each row execute procedure > noise(); > create trigger trig_c after insert on child for each row execute procedure > noise(); > > Now let's try it: > > alvherre=# insert into child values (1); > NOTICE: nyaa > NOTICE: nyaa > INSERT 0 1 > > OK, so where does that one come from? > > alvherre=# \d child > Tabla «public.child» > Columna │ Tipo │ Collation │ Nullable │ Default > ─┼─┼───┼──┼─ > a │ integer │ │ │ > Partition of: parent FOR VALUES FROM (0) TO (100) > Triggers: > trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise() > > Hmm, there's only one trigger here, why does it appear twice? To find > out, you have to know where to look: > > alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger; > tgname │ tgrelid │ tgisinternal > ┼─┼── > trig_p │ parent │ f > trig_p │ child │ t > trig_c │ child │ f > (3 filas) > > So there is a trigger in table child, but it's hidden because > tgisinternal. Of course, you can see it if you look at the parent's > definition: > > alvherre=# \d parent >Tabla «public.parent» > Columna │ Tipo │ Collation │ Nullable │ Default > ─┼─┼───┼──┼─ > a │ integer │ │ │ > Partition key: RANGE (a) > Triggers: > trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise() > Number of partitions: 1 (Use \d+ to list them.) > > I think it'd be useful to have a list of triggers that have been > inherited from ancestors, or maybe simply a list of internal triggers > > Or maybe this is not something to worry about? > One of the things I was thinking about while reading this thread is that the scenario of creating "duplicate" triggers on a table meaning two triggers doing the same thing is entirely possible now but we don't really do anything to prevent it, which is Ok. I've never seen much merit in the argument "people should test" (it assumes a certain infallibility that just isn't true) but we've also generally been pretty good about exposing what is going on so people can debug this type of unexpected behavior. So +1 for thinking we do need to worry about it. I'm not exactly sure how we want to expose that info; with \d+ we list various "Partition X:" sections, perhaps adding one for "Partition triggers:" would be enough, although I am inclined to think it needs exposure at the \d level. One other thing to consider is firing order of said triggers... if all parent level triggers fire before child level triggers then the above separation is straightforward, but if the execution order is, as I suspect, mixed, then it becomes more complicated. Robert Treat http://xzilla.net
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 3:05 PM, Tom Lane wrote: > But if we feel this is worth > pulling into core, I think something along the lines of a GUC listing > indexes to ignore for planning purposes might be a better design. > It'd certainly dodge the issues you mentioned about lack of mutability > of pg_index entries. While adding a mutable column to pg_index is probably ideal having a pg_index_visible table related one-to-one (optional?) with pg_index. MySQL has, and we would probably want, a GUC to control whether to check the table for visibility. Reading the MySQL description for this one use case posited is a DBA wanting to remove an index and see which queries appear in their duration limit log (probably in combination with auto-explain). An SQL interface to the feature seems desirable. On that front VISIBLE and INVISIBLE are the pre-existing keywords for MySQL. As long as BEGIN-ALTER INDEX-ROLLBACK works as expected I wouldn't see any need for a GUC accepting text inputs. That said, somehow making "ALTER INDEX LOCAL name INVISIBLE" work and having it auto-revert back to visible as transaction end would provide for the one major advantage of an in-session SET. David J.
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 6:11 PM, Andres Freund wrote: > On 2018-06-18 18:05:11 -0400, Tom Lane wrote: >> Andres Freund writes: >> > On 2018-06-18 17:57:04 -0400, Tom Lane wrote: >> >> I think the actually desirable way to handle this sort of thing is through >> >> an "index advisor" sort of plugin, which can hide a given index from the >> >> planner without any globally visible side-effects. >> >> > Although I'm a bit doubtful that just shoving this into an extension is >> > really sufficient. This is an extremely common task. >> >> Well, what I was thinking about was that this functionality already >> exists (I think) in one or more "index advisor" plugins. > > They're doing the opposite, right? I.e. they return "hypothetical > indexes", which then can be used by the planner. None of the ones I've > seen currently mask out an existing index. > > >> It's possible that they've all bit-rotted for lack of support, which >> would not speak highly of the demand for the feature. > > IDK, the DBA / developer crowd hitting issues like this isn't the same > as the crowd willing to update an external plugin that doesn't even do > quite what you want, and was more experimental than anything. > Indeed. ISTR a conversation I had with someone on slack earlier this year about the validity of just manually updating indisvalid as a means for determining if an index could be safely removed (for the record, I did not recommend it ;-) DBA's are often willing to weedwhacker at things in SQL when the alternative is to learn C. Robert Treat http://xzilla.net
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan wrote: > On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov > wrote: >> Patch '0001-retail-indextuple-deletion' introduce new function >> amtargetdelete() in access method interface. Patch >> '0002-quick-vacuum-strategy' implements this function for an alternative >> strategy of lazy index vacuum, called 'Quick Vacuum'. > > My compiler shows the following warnings: Some real feedback: What we probably want to end up with here is new lazyvacuum.c code that does processing for one heap page (and associated indexes) that is really just a "partial" lazy vacuum. Though it won't do things like advance relfrozenxid, it will do pruning for the heap page, index tuple killing, and finally heap tuple killing. It will do all of these things reliably, just like traditional lazy vacuum. This will be what your background worker eventually uses. I doubt that you can use routines like index_beginscan() within bttargetdelete() at all. I think that you need something closer to _bt_doinsert() or _bt_pagedel(), that manages its own scan (your code should probably live in nbtpage.c). It does not make sense to teach external, generic routines like index_beginscan() about heap TID being an implicit final index attribute, which is one reason for this (I'm assuming that this patch relies on my own patch). Another reason is that you need to hold an exclusive buffer lock at the point that you identify the tuple to be killed, until the point that you actually kill it. You don't do that now. IOW, the approach you've taken in bttargetdelete() isn't quite correct because you imagine that it's okay to occasionally "lose" the index tuple that you originally found, and just move on. That needs to be 100% reliable, or else we'll end up with index tuples that point to the wrong heap tuples in rare cases with concurrent insertions. As I said, we want a "partial" lazy vacuum here, which must mean that it's reliable. Note that _bt_pagedel() actually calls _bt_search() when it deletes a page. Your patch will not be the first patch that makes nbtree vacuuming do an index scan. You should be managing your own insertion scan key, much like _bt_pagedel() does. If you use my patch, _bt_search() can be taught to look for a specific heap TID. Finally, doing things this way would let you delete multiple duplicates in one shot, as I described in an earlier e-mail. Only a single descent of the tree is needed to delete quite a few index tuples, provided that they all happen to be logical duplicates. Again, your background worker will take advantage of this. This code does not follow the Postgres style: > - else > + } > + else { > + if (rootoffnum != latestdead) > + heap_prune_record_unused(prstate, latestdead); > heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); > + } > } Please be more careful about that. I find it very distracting. -- Peter Geoghegan
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan wrote: > Finally, doing things this way would let you delete multiple > duplicates in one shot, as I described in an earlier e-mail. Only a > single descent of the tree is needed to delete quite a few index > tuples, provided that they all happen to be logical duplicates. Again, > your background worker will take advantage of this. BTW, when you do this you should make sure that there is only one call to _bt_delitems_vacuum(), so that there aren't too many WAL records. Actually, that's not quite correct -- there should be one _bt_delitems_vacuum() call *per leaf page* per bttargetdelete() call, which is slightly different. There should rarely be more than one or two calls to _bt_delitems_vacuum() in total, because your background worker is only going to delete one heap page's duplicates per bttargetdelete() call, and because there will be locality/correlation with my TID order patch. -- Peter Geoghegan
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 3:17 PM, Andrew Dunstan < andrew.duns...@2ndquadrant.com> wrote: > Yeah, but Peter makes the case that people want it for global > experimentation. "We think we can safely drop this humungous index that > would take us days to rebuild, but before we do let's make it invisible and > run for a few days just to make sure." I guess we could do that with a GUC, > but it seems ugly. > On that front what's the proposed behavior for cached plans using said index? IIUC with a GUC you'd have to force clients to establish new sessions if you wanted all queries to be affected by the new setting whereas using cache invalidation you can affect existing sessions with a catalog update. David J.
Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development
On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote: > As there is visibly a consensus for HEAD, for now I propose to just > process with the previous patch on only the master branch then. This > will address the open item. Any objections to that? As there is a consensus at least on this part, I have just pushed the patch to only HEAD. I'll notify buildfarm members which see failures in case those link to OpenSSL 1.0.1 or older. -- Michael signature.asc Description: PGP signature
Re: Supporting tls-server-end-point as SCRAM channel binding for OpenSSL 1.0.0 and 1.0.1
On Sat, Jun 09, 2018 at 09:28:17AM +0900, Michael Paquier wrote: > I am still not completely sure what is the correct course of action > here. Heikki and Peter and not much in favor of adding more complexity > here as OpenSSL has a long history of having a non-linear history across > platforms. On the other side, PGDG provides packages down to RHEL6, and > there are surely servers which use it as backend. As Peter and Heikki have worked as well on all those features with me, are there any objection to discard this open item? I looked again at the patch this morning and it is true that OpenSSL's history makes things harder, so keeping code consistent and simple with their last LTS version is appealing. -- Michael signature.asc Description: PGP signature
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
Hi. On 2018/06/19 1:59, Alvaro Herrera wrote: > What does worry me a little bit now, reading this discussion, is whether > we've made the triggers in partitions visible enough. We'll have this > problem once we implement BEFORE ROW triggers as proposed, and I think > we already have this problem now. Let's suppose a user creates a > duplicated after trigger: > > create table parent (a int) partition by range (a); > create table child partition of parent for values from (0) to (100); > create function noise() returns trigger language plpgsql as $$ begin raise > notice 'nyaa'; return null; end; $$; > create trigger trig_p after insert on parent for each row execute procedure > noise(); > create trigger trig_c after insert on child for each row execute procedure > noise(); > > Now let's try it: > > alvherre=# insert into child values (1); > NOTICE: nyaa > NOTICE: nyaa > INSERT 0 1 > > OK, so where does that one come from? > > alvherre=# \d child > Tabla «public.child» > Columna │ Tipo │ Collation │ Nullable │ Default > ─┼─┼───┼──┼─ > a │ integer │ │ │ > Partition of: parent FOR VALUES FROM (0) TO (100) > Triggers: > trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise() > > Hmm, there's only one trigger here, why does it appear twice? To find > out, you have to know where to look: > > alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger; > tgname │ tgrelid │ tgisinternal > ┼─┼── > trig_p │ parent │ f > trig_p │ child │ t > trig_c │ child │ f > (3 filas) > > So there is a trigger in table child, but it's hidden because > tgisinternal. Of course, you can see it if you look at the parent's > definition: > > alvherre=# \d parent >Tabla «public.parent» > Columna │ Tipo │ Collation │ Nullable │ Default > ─┼─┼───┼──┼─ > a │ integer │ │ │ > Partition key: RANGE (a) > Triggers: > trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise() > Number of partitions: 1 (Use \d+ to list them.) > > I think it'd be useful to have a list of triggers that have been > inherited from ancestors, or maybe simply a list of internal triggers In CreateTrigger(), 86f575948c7 did this. -values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal); +values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal || in_partition); I'm not sure why it had to be done, but undoing this change doesn't seem to break any regression tests (neither those added by 86f575948c7 nor of the partitioned table foreign key commit). Did we really intend to change the meaning of tginternal like this here? Thanks, Amit
Re: Index Skip Scan
On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote: > Assuming this, should we have possibility to register patch to > September CF from now? There cannot be two commit fests marked as open at the same time as Magnus mentions here: https://www.postgresql.org/message-id/cabuevex1k+axzcv2t3weyf5ulg72ybksch_huhfnzq+-kso...@mail.gmail.com There is no issue in registering new patches in future ones for admins, but normal users cannot, right? In this case, could you wait that the next CF is marked as in progress and that the one of September is opened? You could also add it to the July one if you are not willing to wait, and that will get bumped by one of the CFMs, but this makes the whole process unnecessary noisy. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Find additional connection service files in pg_service.conf.d directory
On Mon, Jun 18, 2018 at 07:17:14PM +0300, Arthur Zakirov wrote: > Isn't it worth to check errno for stat(), opendir() and readdir() > calls? Checking for such errors is mandatory. There are some cases where there are filters based on errno like ENOENT, but if something unexpected is happening the user should know about it. Of course this depends on the feature discussed and its properties.. -- Michael signature.asc Description: PGP signature
Re: Partitioning with temp tables is broken
Hello. On 2018/06/18 15:02, Michael Paquier wrote: > On Mon, Jun 18, 2018 at 01:27:51PM +0900, Amit Langote wrote: >> On 2018/06/17 22:11, Michael Paquier wrote: >> Which checks do you think are missing other than those added by the >> proposed patch? > > I was just wondering how this reacted if trying to attach a foreign > table to a partition tree which is made of temporary tables, and things > get checked in MergeAttributes, and you have added a check for that: > +-- do not allow a foreign table to become a partition of temporary > +-- partitioned table > +create temp table temp_parted (a int) partition by list (a); > > Those tests should be upper-case I think to keep consistency with the > surrounding code. Ah, sorry about that. As you may have seen in the changed code, the guard in MergeAttributes really just checks relpersistance, so the check prevents foreign tables from being added as a partition of a temporary parent table. Not sure how much sense it makes to call a foreign table's relpersistence to be RELPERSISTENCE_PERMANENT but that's a different matter I guess. One cannot create temporary foreign tables because of the lack of syntax for it, so there's no need to worry about that. >>> I am quickly looking at forbid-temp-parts-1.patch from previous message >>> https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp >>> and this shines per its lack of tests. It would be easy enough to test >>> that temp and permanent relations are not mixed within the same session >>> for multiple levels of partitioning. Amit, could you add some? There >>> may be tweaks needed for foreign tables or such, but I have not looked >>> close enough at the problem yet.. >> >> OK, I have added some tests. Thanks for looking! > > Okay, I have looked at this patch and tested it manually and I can > confirm the following restrictions: > > - Partition trees are supported for a set of temporary relations within > the same session. > - If trying to work with temporary relations from different sessions, > then failure. > - If trying to attach a temporary partition to a permanent parent, then > failure. > - If trying to attach a permanent relation to a temporary parent, then > failure. > > + /* If the parent is permanent, so must be all of its partitions. */ > + if (is_partition && > + relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP && > + relpersistence == RELPERSISTENCE_TEMP) > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > +errmsg("cannot create a temporary relation as partition of > permanent relation \"%s\"", > +RelationGetRelationName(relation; > > I had some second thoughts about this one actually because inheritance > trees allow a temporary child for a permanent parent: > > =# create table aa_parent (a int); > CREATE TABLE > =# create temp table aa_temp_child (a int) inherits (aa_parent); > NOTICE: 0: merging column "a" with inherited definition > LOCATION: MergeAttributes, tablecmds.c:2306 > CREATE TABLE > > And they also disallow a permanent child to inherit from a temporary > parent: > =# create temp table aa_temp_parent (a int); > CREATE TABLE > =# create table aa_child (a int) inherits (aa_temp_parent> ERROR: 42809: > cannot inherit from temporary relation "aa_temp_parent" > > I am not seeing any regression tests for that actually so that would be > a nice addition, with also a test for inheritance of only temporary > relations. That's not something for the patch discussed on this thread > to tackle. > > Still the partition bound checks make that kind of harder to bypass, and > the use-case is not obvious, so let's keep the restriction as > suggested... Yeah, unlike regular inheritance, we access partitions from PartitionDesc without checking relpersistence in some of the newly added code in PG 11 and also in PG 10 (the latter doesn't crash but gives an unintuitive error as you said upthread). If a use case to mix temporary and permanent tables in partition tree pops up in the future, we can revisit it and implement it correctly. > - /* Ditto for the partition */ > + /* Ditto for the partition. */ > > Some noise here. Oops, fixed. > Adding a test which makes sure that partition trees made of only > temporary relations work would be nice. I added a test to partition_prune.sql. > Documenting all those restrictions and behaviors would be nice, why not > adding a paragraph in ddl.sgml, under the section for declarative > partitioning? OK, I've tried that in the attached updated patch, but I couldn't write beyond a couple of sentences that I've added in 5.10.2.3. Limitations. Thanks, Amit >From 8c067c66a4ccf25eee75f6454de7087e7228a259 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 15 Jun 2018 10:20:26 +0900 Subject: [PATCH v3] Disallow mixing partitions of differing relpersistence and visibility --- doc/src/sgml/ddl.sgml | 8 src/b
Adding tests for inheritance trees with temporary tables
Hi all, While look at the handling of temporary relations with partition trees, I have noticed that there are no tests for inheritance trees with temp tables. This has been mentioned here: https://www.postgresql.org/message-id/20180618060200.gg3...@paquier.xyz Attached is a patch to close the gap. That's of course not v11 material, so I am parking this patch into the next CF. Thanks, -- Michael diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out index b2b912ed5c..9b1312a0ca 100644 --- a/src/test/regress/expected/inherit.out +++ b/src/test/regress/expected/inherit.out @@ -1710,6 +1710,51 @@ select * from cnullparent where f1 = 2; drop table cnullparent cascade; NOTICE: drop cascades to table cnullchild -- +-- Check use of temprary tables with inheritance trees +-- +create table inh_perm_parent (a1 int); +create temp table inh_temp_parent (a1 int); +create temp table inh_temp_child (a1 int) inherits (inh_perm_parent); -- ok +NOTICE: merging column "a1" with inherited definition +create table inh_perm_child (a1 int) inherits (inh_temp_parent); -- error +ERROR: cannot inherit from temporary relation "inh_temp_parent" +create temp table inh_temp_child_2 (a1 int) inherits (inh_temp_parent); -- ok +NOTICE: merging column "a1" with inherited definition +insert into inh_perm_parent values (1); +insert into inh_temp_parent values (2); +insert into inh_temp_child values (3); +insert into inh_temp_child_2 values (4); +select * from inh_perm_parent; + a1 + + 1 + 3 +(2 rows) + +select * from inh_temp_parent; + a1 + + 2 + 4 +(2 rows) + +select * from inh_temp_child; + a1 + + 3 +(1 row) + +select * from inh_temp_child_2; + a1 + + 4 +(1 row) + +drop table inh_perm_parent cascade; +NOTICE: drop cascades to table inh_temp_child +drop table inh_temp_parent cascade; +NOTICE: drop cascades to table inh_temp_child_2 +-- -- Check that constraint exclusion works correctly with partitions using -- implicit constraints generated from the partition bound information. -- diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql index 5a48376fc0..2c3e35583a 100644 --- a/src/test/regress/sql/inherit.sql +++ b/src/test/regress/sql/inherit.sql @@ -635,6 +635,25 @@ select * from cnullparent; select * from cnullparent where f1 = 2; drop table cnullparent cascade; +-- +-- Check use of temprary tables with inheritance trees +-- +create table inh_perm_parent (a1 int); +create temp table inh_temp_parent (a1 int); +create temp table inh_temp_child (a1 int) inherits (inh_perm_parent); -- ok +create table inh_perm_child (a1 int) inherits (inh_temp_parent); -- error +create temp table inh_temp_child_2 (a1 int) inherits (inh_temp_parent); -- ok +insert into inh_perm_parent values (1); +insert into inh_temp_parent values (2); +insert into inh_temp_child values (3); +insert into inh_temp_child_2 values (4); +select * from inh_perm_parent; +select * from inh_temp_parent; +select * from inh_temp_child; +select * from inh_temp_child_2; +drop table inh_perm_parent cascade; +drop table inh_temp_parent cascade; + -- -- Check that constraint exclusion works correctly with partitions using -- implicit constraints generated from the partition bound information. signature.asc Description: PGP signature
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan wrote: > IOW, the approach you've taken in bttargetdelete() isn't quite correct > because you imagine that it's okay to occasionally "lose" the index > tuple that you originally found, and just move on. That needs to be > 100% reliable, or else we'll end up with index tuples that point to > the wrong heap tuples in rare cases with concurrent insertions. Attached patch adds a new amcheck check within bt_index_parent_check(). With the patch, heap TIDs are accumulated in a tuplesort and sorted at the tail end of verification (before optional heapallindexed verification runs). This will reliably detect the kind of corruption I noticed was possible with your patch. Note that the amcheck enhancement that went along with my heap-tid-btree-sort patch may not have detected this issue, which is why I wrote this patch -- the heap-tid-btree-sort amcheck stuff could detect duplicates, but only when all other attributes happened to be identical when comparing sibling index tuples (i.e. only when we must actually compare TIDs across sibling index tuples). If you add this check, I'm pretty sure that you can detect any possible problem. You should think about using this to debug your patch. I may get around to submitting this to a CF, but that isn't a priority right now. -- Peter Geoghegan 0001-Detect-duplicate-heap-TIDs-using-a-tuplesort.patch Description: Binary data
Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development
On Tue, Jun 19, 2018 at 09:11:20AM +0900, Michael Paquier wrote: > On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote: > > As there is visibly a consensus for HEAD, for now I propose to just > > process with the previous patch on only the master branch then. This > > will address the open item. Any objections to that? > > As there is a consensus at least on this part, I have just pushed the > patch to only HEAD. I'll notify buildfarm members which see failures in > case those link to OpenSSL 1.0.1 or older. Andrew, bowerbird is complaining with what looks like a library loading issue with initdb: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=bowerbird&dt=2018-06-19%2001%3A08%3A30&stg=check pg_regress: initdb failed Examine G:/prog/bf/root/HEAD/pgsql.build/src/test/regress/log/initdb.log for the reason. In my own environment I have a copy of ssleay32.dll and libeay32.dll in c:\Windows\System32 so as that I don't get any complaints when running regression tests. Is your environment using something specific? -- Michael signature.asc Description: PGP signature
Re: Libpq support to connect to standby server as priority
On Wed, Jan 24, 2018 at 9:01 AM Jing Wang wrote: > Hi All, > > Recently I put a proposal to support 'prefer-read' parameter in > target_session_attrs in libpq. Now I updated the patch with adding content > in the sgml and regression test case. > > Some people may have noticed there is already another patch ( > https://commitfest.postgresql.org/15/1148/ ) which looks similar with > this. But I would say this patch is more complex than my proposal. > > It is better separate these 2 patches to consider. > I also feel prefer-read and read-only options needs to take as two different options. prefer-read is simple to support than read-only. Here I attached an updated patch that is rebased to the latest master and also fixed some of the corner scenarios. Regards, Haribabu Kommi Fujitsu Australia 0001-Allow-target-session-attrs-to-accept-prefer-read-opti.patch Description: Binary data
Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"
On Mon, Jun 18, 2018 at 3:09 PM, Amit Khandekar wrote: > On 16 June 2018 at 19:30, Amit Kapila wrote: >> On Sat, Jun 16, 2018 at 10:44 AM, Amit Kapila >> wrote: >>> Yeah, or perhaps disallow creation of any partial paths at the first >>> place like in attached. This will save us some work as well. >>> >> >> Attached patch contains test case as well. I have tried to come up >> with some simple test, but couldn't come up with anything much simpler >> than reported by Rajkumar, so decided to use the test case provided by >> him. >> > > Thanks for the patch. I had a look at it, and it looks good to me. One > minor comment : > > +-- Parallel Append is not be used when the subpath depends on the outer param > "is not be used" => "is not to be used" > Fixed in the attached patch. I will wait for a day or two to see if Tom or Robert wants to say something and then commit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_pa_path_generation_v3.patch Description: Binary data
Re: Pluggable storage
On Tue, Jun 19, 2018 at 1:13 AM, AJG wrote: > @Amit > > Re: Vacuum etc. > > Chrome V8 just released this blog post around concurrent marking, which may > be of interest considering how cpu limited the browser is. Contains > benchmark numbers etc in post as well. > > https://v8project.blogspot.com/2018/06/concurrent-marking.html > > "This post describes the garbage collection technique called concurrent > marking. The optimization allows a JavaScript application to continue > execution while the garbage collector scans the heap to find and mark live > objects. Our benchmarks show that concurrent marking reduces the time spent > marking on the main thread by 60%–70%" > Thanks for sharing the link, but I don't think it is not directly related to the work we are doing, but feel free to discuss it on zheap thread [1] or even you can start a new thread, because it appears more like some general technique to improve GC (garbage collection) rather than something directly related zheap (or undo) technology. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BYtM5vxzSM2NZm%2BpC37MCwyvtkmJrO_yRBQeZDp9Wa2w%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development
On 06/18/2018 11:13 PM, Michael Paquier wrote: On Tue, Jun 19, 2018 at 09:11:20AM +0900, Michael Paquier wrote: On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote: As there is visibly a consensus for HEAD, for now I propose to just process with the previous patch on only the master branch then. This will address the open item. Any objections to that? As there is a consensus at least on this part, I have just pushed the patch to only HEAD. I'll notify buildfarm members which see failures in case those link to OpenSSL 1.0.1 or older. Andrew, bowerbird is complaining with what looks like a library loading issue with initdb: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=bowerbird&dt=2018-06-19%2001%3A08%3A30&stg=check pg_regress: initdb failed Examine G:/prog/bf/root/HEAD/pgsql.build/src/test/regress/log/initdb.log for the reason. In my own environment I have a copy of ssleay32.dll and libeay32.dll in c:\Windows\System32 so as that I don't get any complaints when running regression tests. Is your environment using something specific? I have adjusted the PATH setting - it should be fixed on the next run. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services