Re: slight tweaks to documentation about runtime pruning
On Wed, 5 Dec 2018 at 20:24, Amit Langote wrote: > Documentation of run-time pruning tells readers to inspect "nloops" > property of the EXPLAIN ANALYZE output, but I think that's a typo of > "loops" which is actually output ("internal variable to track that > property is indeed nloops). I agree. The 'n' should be dropped there. > However, for pruned partitions' subplans, what's actually shown is the > string "(never executed)", not loops. So, wouldn't it be better to tell > the readers to look for that instead of "loops"? I don't really see any issues with the existing documentation here. Remember that pruning can be performed multiple times when a parameter changes that was found to match the partition key and the Append/MergeAppend is rescanned. For example: create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); create index on listp(a); set enable_bitmapscan=0; explain analyze select * from (values(1),(1),(2)) a(a) inner join listp l on a.a = l.a; QUERY PLAN Nested Loop (cost=0.15..91.50 rows=76 width=8) (actual time=0.013..0.013 rows=0 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time=0.001..0.002 rows=3 loops=1) -> Append (cost=0.15..30.23 rows=26 width=4) (actual time=0.003..0.003 rows=0 loops=3) -> Index Only Scan using listp1_a_idx on listp1 l (cost=0.15..15.05 rows=13 width=4) (actual time=0.002..0.002 rows=0 loops=2) Index Cond: (a = "*VALUES*".column1) Heap Fetches: 0 -> Index Only Scan using listp2_a_idx on listp2 l_1 (cost=0.15..15.05 rows=13 width=4) (actual time=0.003..0.003 rows=0 loops=1) Index Cond: (a = "*VALUES*".column1) Heap Fetches: 0 Planning Time: 0.158 ms Execution Time: 0.042 ms (11 rows) listp1 was scanned twice (loops=2), listp2 was scanned just once. Now it is true that if the subplan was executed 0 times that it will appear as "(never executed)", but do we really need to explain in this area that "(never executed)" means loops=0? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Should new partitions inherit their tablespace from their parent?
Thank you for looking at this. On Fri, 7 Dec 2018 at 20:15, Michael Paquier wrote: > -ATExecPartedIdxSetTableSpace(Relation rel, Oid newTableSpace) > +ATExecSetTableSpaceNoStorage(Relation rel, Oid newTableSpace) > NoStorage looks strange as routine name for this case. Would something > like ATExecPartedRelSetTableSpace be more adapted perhaps? I'd say it's better to name the function in the more general purpose way. Perhaps we'll get some relkind in the future that's not a partitioned table or index that might need the reltablespace set. When that happens, if we name the function the way you're proposing, then we might end up with some other function which does the same thing, or the patch adding the new relkind would have to rename the function to something more like how I have it. To save from either of those having to happen, would it not be better to give it the generic name now? Instead of renaming the function, I've altered the Assert() to allow all relkinds which don't have storage and also updated the comment to remove the mention of partitioned tables and indexes. > + else if (stmt->partbound) > + { > + RangeVar *parent; > + Relationparentrel; > + > + /* > +* For partitions, when no other tablespace is specified, we default > +* the tablespace to the parent partitioned table's. > +*/ > Okay, so the direct parent is what counts, and not the top-most parent. > Could you add a test with multiple level of partitioned tables, like: > - parent is in tablespace 1. > - child is created in tablespace 2. > - grandchild is created, which should inherit tablespace 2 from the > child, but not tablespace 1 from the parent. In the existing example, > as one partition is used to test the top-most parent and another for the > new default, it looks cleaner to create a third partition which would be > itself a partitioned table. Sounds good. I've modified the existing test just to do it that way. I don't think there's a need to have multiple tests for that. I've attached an updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v2-0001-Allow-newly-created-partitions-to-inherit-their-p.patch Description: Binary data
Re: slight tweaks to documentation about runtime pruning
On Sun, Dec 9, 2018 at 8:13 PM David Rowley wrote: > listp1 was scanned twice (loops=2), listp2 was scanned just once. > > Now it is true that if the subplan was executed 0 times that it will > appear as "(never executed)", but do we really need to explain in this > area that "(never executed)" means loops=0? Ah, I see what you mean. So, "(never executed)" is not the only sign of of run-time pruning occurring. The value of loops can be different for different subplans / partitions, and it being lower for a given subplan means that the subplan has been pruned more number of times. Thanks, Amit
Re: pg_partition_tree crashes for a non-defined relation
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Sat, Dec 08, 2018 at 08:46:08AM -0500, Stephen Frost wrote: > > I wonder if we maybe should have a regression test for every such > > function which just queries the catalog in a way to force the function > > to be called for every relation defined in the regression tests, to > > ensure that it doesn't segfault or throw an error.. > > Like sqlsmith? It looks hard to me to make something like that part of > the main regression test suite, as that's going to be costly and hard to > scale with. No, I mean something like: with x as (select pg_partition_tree(relname) from pg_class) select 1 from x limit 1; or whatever it takes to make sure that the function is run against every entry in pg_class (or whatever is appropriate) while not returning the results (since we don't actually care about the output, we just want to make sure it doesn't ERROR or crash). sqlsmith, as I recall, doesn't care about ERROR cases, it's just looking for crashes, so it's not quite the same thing. Thanks! Stephen signature.asc Description: PGP signature
Re: pg_partition_tree crashes for a non-defined relation
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Sun, Dec 09, 2018 at 08:15:07AM +0900, Michael Paquier wrote: > > On Sat, Dec 08, 2018 at 08:46:08AM -0500, Stephen Frost wrote: > >> We should really have a more clearly defined policy around this, but my > >> recollection is that we often prefer to return NULL rather than throwing > >> an error for the convenience of people doing things like querying > >> pg_class using similar functions. > > > > Yes, that's visibly right. At least that's what I can see from the > > various pg_get_*def and pg_*_is_visible. Returning NULL would indeed > > be more consistent. > > Thinking more about your argument, scanning fully pg_class is quite > sensible as well because there is no need to apply an extra qual on > relkind, so let's change the function as you suggest, by returning NULL > on invalid relation type. Any opinions about the attached then which > does the switch? Looks alright on a quick glance, but shouldn't you also update the comment..? Thanks! Stephen signature.asc Description: PGP signature
Collatability of type "name"
I've been experimenting with the task proposed in [1] of expanding the text_ops operator family to include type "name" as well as cross-type text vs. name operators. These operators would need to offer collation-aware sorting, since that's exactly the difference between text_ops and the non-collation-aware name_ops opfamily. I ran into a nasty stumbling block almost immediately: the proposed name vs. name comparison operators fail, because the parser sees that both inputs are of noncollatable types so it doesn't assign any collation to the operator node. I experimented with leaving out the name vs. name operators and just adding cross-type text vs. name and name vs. text operators. That turns out not to work well at all. Aside from the fact that opr_sanity whines about an incomplete operator family, I found various situations where the planner fails, complaining about things like "missing operator 1(19,19) in opfamily 1994". The root of that mess seems to be that we've supposed that if an equality operator is marked mergejoinable then it is mergejoinable in every opfamily that it's a member of. But that isn't true in an opfamily structure like this. For instance "text = name" should be mergejoinable in the name_ops opclass, since we know how to sort both text and name in non-collation-aware ways. But it's not mergejoinable in the text_ops opclass if text_ops doesn't provide collation-aware name vs. name operators to sort the name input with. We could probably fix that, at the cost of about tripling the work needed to detect whether an operator is really mergejoinable, but I have little confidence that there aren't more problems lurking behind it. There are a lot of aspects of EquivalenceClass processing that look pretty questionable if we're trying to support operators that act this way. For instance, if we derive "a = c" given "a = b" and "b = c", the equality operator in "a = c" might be mergejoinable in a different set of opclasses than the other two operators are, making it debatable whether it can be thought to belong to the same EquivalenceClass at all. So the other approach I'm contemplating is to mark type name as collatable (with "C" as its typcollation, probably). There are two plausible sub-approaches: 1. The regular name comparison operators remain non-collation-aware. This would be the least invasive way but it'd have the odd side-effect that expressions like "namecoll1 < namecoll2 COLLATE something" would be accepted but the collation would be ignored. Also, we'd have to invent some new names for the collation-aware name-vs-name operators, and I don't see any obvious candidate for that. 2. Upgrade the name comparison operators to be collation-aware, with (probably) all the same optimizations for C collation as we have for text. This'd be a cleaner end result but it seems like there are a lot of potential side-effects, e.g. syscache lookups would have to be prepared to pass the right collation argument to name comparisons. I feel like #2 is probably really the Right Thing, but it's also sounding like significantly more work than I thought this was going to involve. Not sure if it's worth the effort right now. Also, I think that either solution would lead to some subtle changes in semantics. For example, right now if you compare a name column to a text value, you get a text (collation-aware) comparison using the database's default collation. It looks like if name columns are marked with attcollation = 'C', that would win and the comparison would now have 'C' collation unless you explicitly override it with a COLLATE clause. I'm not sure this is a bad thing --- it'd be more likely to match the sort order of the index on the column --- but it could surprise people. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/5978.1544030...@sss.pgh.pa.us
Re: pg_partition_tree crashes for a non-defined relation
Stephen Frost writes: > * Michael Paquier (mich...@paquier.xyz) wrote: >> On Sat, Dec 08, 2018 at 08:46:08AM -0500, Stephen Frost wrote: >>> I wonder if we maybe should have a regression test for every such >>> function which just queries the catalog in a way to force the function >>> to be called for every relation defined in the regression tests, to >>> ensure that it doesn't segfault or throw an error.. > No, I mean something like: > with x as (select pg_partition_tree(relname) from pg_class) > select 1 from x limit 1; > or whatever it takes to make sure that the function is run against every > entry in pg_class (or whatever is appropriate) while not returning the > results (since we don't actually care about the output, we just want to > make sure it doesn't ERROR or crash). I'm going to object to that on cost grounds. It is not reasonable to run moderately-expensive functions like this on more than a thousand pg_class entries in order to test what are just a few distinct cases, especially in code that's highly unlikely to break once written. Or at least, it's not reasonable to do that every time anybody runs the regression tests for the rest of our lives. Moreover, this would only help check a specific new function if the author or reviewer remembered to add a test case that does it. Since the whole problem here is "I forgot to consider this", I don't have much faith in that happening. Maybe we should have some sort of checklist of things to think about when adding new SQL-visible functions, rather than trying to memorialize every such consideration as a regression test no matter how expensive. Admittedly, "I forgot to go over the checklist" is still a problem; but at least it's not penalizing every developer and every buildfarm run till kingdom come. regards, tom lane
Re: Collatability of type "name"
Hi ne 9. 12. 2018 v 18:50 odesílatel Tom Lane napsal: > I've been experimenting with the task proposed in [1] of expanding > the text_ops operator family to include type "name" as well as > cross-type text vs. name operators. These operators would need to > offer collation-aware sorting, since that's exactly the difference > between text_ops and the non-collation-aware name_ops opfamily. > I ran into a nasty stumbling block almost immediately: the proposed > name vs. name comparison operators fail, because the parser sees > that both inputs are of noncollatable types so it doesn't assign > any collation to the operator node. > > I experimented with leaving out the name vs. name operators and > just adding cross-type text vs. name and name vs. text operators. > That turns out not to work well at all. Aside from the fact that > opr_sanity whines about an incomplete operator family, I found > various situations where the planner fails, complaining about > things like "missing operator 1(19,19) in opfamily 1994". The > root of that mess seems to be that we've supposed that if an > equality operator is marked mergejoinable then it is mergejoinable > in every opfamily that it's a member of. But that isn't true in > an opfamily structure like this. For instance "text = name" should > be mergejoinable in the name_ops opclass, since we know how to sort > both text and name in non-collation-aware ways. But it's not > mergejoinable in the text_ops opclass if text_ops doesn't provide > collation-aware name vs. name operators to sort the name input with. > > We could probably fix that, at the cost of about tripling the work > needed to detect whether an operator is really mergejoinable, but > I have little confidence that there aren't more problems lurking > behind it. There are a lot of aspects of EquivalenceClass processing > that look pretty questionable if we're trying to support operators > that act this way. For instance, if we derive "a = c" given "a = b" > and "b = c", the equality operator in "a = c" might be mergejoinable > in a different set of opclasses than the other two operators are, > making it debatable whether it can be thought to belong to the same > EquivalenceClass at all. > > So the other approach I'm contemplating is to mark type name as > collatable (with "C" as its typcollation, probably). There are > two plausible sub-approaches: > > 1. The regular name comparison operators remain non-collation-aware. > This would be the least invasive way but it'd have the odd side-effect > that expressions like "namecoll1 < namecoll2 COLLATE something" > would be accepted but the collation would be ignored. Also, we'd > have to invent some new names for the collation-aware name-vs-name > operators, and I don't see any obvious candidate for that. > > 2. Upgrade the name comparison operators to be collation-aware, > with (probably) all the same optimizations for C collation as we > have for text. This'd be a cleaner end result but it seems like > there are a lot of potential side-effects, e.g. syscache lookups > would have to be prepared to pass the right collation argument > to name comparisons. > > I feel like #2 is probably really the Right Thing, but it's also > sounding like significantly more work than I thought this was going > to involve. Not sure if it's worth the effort right now. > > Also, I think that either solution would lead to some subtle changes > in semantics. For example, right now if you compare a name column > to a text value, you get a text (collation-aware) comparison using > the database's default collation. It looks like if name columns > are marked with attcollation = 'C', that would win and the comparison > would now have 'C' collation unless you explicitly override it with > a COLLATE clause. I'm not sure this is a bad thing --- it'd be more > likely to match the sort order of the index on the column --- but it > could surprise people. > The sort of table's names is not too common operation. I don't see a C collate for names as any risk. Regards Pavel > > Thoughts? > > regards, tom lane > > [1] https://www.postgresql.org/message-id/5978.1544030...@sss.pgh.pa.us > >
Re: automatically assigning catalog toast oids
John Naylor writes: > Commit 96cdeae07 added toast tables to most catalogs. One disadvantage > is that the toast declarations require hard-coded oids, even though > only shared catalogs actually need stable oids. Now that we can assign > oids on the fly, it makes sense to do so for toast tables as well, as > in the attached. I'm a bit dubious that this is a good idea. It's handy, at least for forensic situations, that the system catalogs have stable OIDs. regards, tom lane
Re: pg_partition_tree crashes for a non-defined relation
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Michael Paquier (mich...@paquier.xyz) wrote: > >> On Sat, Dec 08, 2018 at 08:46:08AM -0500, Stephen Frost wrote: > >>> I wonder if we maybe should have a regression test for every such > >>> function which just queries the catalog in a way to force the function > >>> to be called for every relation defined in the regression tests, to > >>> ensure that it doesn't segfault or throw an error.. > > > No, I mean something like: > > with x as (select pg_partition_tree(relname) from pg_class) > > select 1 from x limit 1; > > or whatever it takes to make sure that the function is run against every > > entry in pg_class (or whatever is appropriate) while not returning the > > results (since we don't actually care about the output, we just want to > > make sure it doesn't ERROR or crash). > > I'm going to object to that on cost grounds. It is not reasonable to > run moderately-expensive functions like this on more than a thousand > pg_class entries in order to test what are just a few distinct cases, > especially in code that's highly unlikely to break once written. Yeah, I had been wondering if it would be too expensive also. I don't entirely buy off on the argument that it's code that's 'highly unlikely to break once written' though- we do add new relkinds from time to time, for example. Perhaps we could have these functions run just once per relkind. > Moreover, this would only help check a specific new function if the > author or reviewer remembered to add a test case that does it. We could possibly write a test which would run every function that accepts the typical data types (oid/regclass/name/etc) instead of depending on the author to remember to add it. > Since the whole problem here is "I forgot to consider this", I don't > have much faith in that happening. Yeah, I agree that we'd want something more automated as, otherwise, it would definitely be likely to be forgotten. > Maybe we should have some sort of checklist of things to think about > when adding new SQL-visible functions, rather than trying to > memorialize every such consideration as a regression test no matter > how expensive. Admittedly, "I forgot to go over the checklist" is > still a problem; but at least it's not penalizing every developer and > every buildfarm run till kingdom come. This seems like something we should do regardless. Moreover, I'd suggest that we start documenting some of these policies in the way that we have a style guide for errors and such- we need a "system function style guide" that could start with something like "prefer to return NULL instead of ERROR on unexpected but otherwise valid inputs, and test that the code doesn't segfault when given a variety of inputs". Thanks! Stephen signature.asc Description: PGP signature
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
Hello I review code and documentation and i have few notes. Did you register this patch in CF app? I found one error in phase 4. Simple reproducer: > create table test (i int); > create index this_is_very_large_exactly_maxnamelen_index_name_wink_wink_wink > on test (i); > create index this_is_very_large_exactly_maxnamelen_index_name_wink_winkccold > on test (i); > reindex table CONCURRENTLY test; This fails with error > ERROR: duplicate key value violates unique constraint > "pg_class_relname_nsp_index" > DETAIL: Key (relname, > relnamespace)=(this_is_very_large_exactly_maxnamelen_index_name_wink_win_ccold, > 2200) already exists. CommandCounterIncrement() in (or after) index_concurrently_swap will fix this issue. > ReindexPartitionedIndex(Relation parentIdx) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >errmsg("REINDEX is not yet implemented for partitioned > indexes"))); I think we need add errhint("you can REINDEX each partition separately") or something similar. Also can we omit this warning for reindex database? All partition must be in same database and warning in such case is useless: we have warning, but doing reindex for each partition => we reindex partitioned table correctly. Another behavior issue i found with reindex (verbose) schema/database: INFO ereport is printed twice for each table. > INFO: relation "measurement_y2006m02" was reindexed > DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s. > INFO: table "public.measurement_y2006m02" was reindexed One from ReindexMultipleTables and another (with pg_rusage_show) from ReindexRelationConcurrently. > ReindexRelationConcurrently > if (!indexRelation->rd_index->indisvalid) it is better use IndexIsValid macro here? And same question about added indexform->indisvalid in src/backend/commands/tablecmds.c > > An index build with the CONCURRENTLY option failed, > leaving > an invalid index. Such indexes are useless but it can be >- convenient to use REINDEX to rebuild them. Note that >- REINDEX will not perform a concurrent build. To >build the >- index without interfering with production you should drop the index and >- reissue the CREATE INDEX CONCURRENTLY command. >+ convenient to use REINDEX to rebuild them. > This documentation change seems wrong for me: reindex concurrently does not rebuild invalid indexes. To fix invalid indexes we still need reindex with lock table or recreate this index concurrently. > + A first pass to build the index is done for each new index entry. > + Once the index is built, its flag pg_class.isready > is > + switched to true > + At this point pg_class.indisvalid is switched to > + true for the new index and to false for > the old, and > + Old indexes have pg_class.isready switched to > false Should be pg_index.indisvalid and pg_index.indisready, right? regards, Sergei
Re: pg_partition_tree crashes for a non-defined relation
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> ... especially in code that's highly unlikely to break once written. > I don't entirely buy off on the argument that it's code that's 'highly > unlikely to break once written' though- we do add new relkinds from time > to time, for example. Perhaps we could have these functions run just > once per relkind. Well, the relevant code is likely to be "if relkind is not x, y, or z, then PG_RETURN_NULL". If we add a new relkind and forget to consider the function, the outcome is a NULL result that perhaps should not have been NULL ... but a test like this won't help us notice that. regards, tom lane
[PATCH] Minor cleanups of BRIN code
Hi list, I was messing around a bit with BRIN code and found some cleanups to be made: 1. Remove dead code in brin_form_tuple 2. Add missing Datum conversion macros 3. Use AttrNumber type in places using 1-based attribute numbers Though it's not hard to find cases all over source code similar to (2) and (3), are these kinds of cleanups considered useful? Regards, Marti Raudsepp From 002067288d7d8db0c2107e6266f95c53a1614b1d Mon Sep 17 00:00:00 2001 From: Marti Raudsepp Date: Wed, 25 Oct 2017 22:30:14 +0300 Subject: [PATCH] brin: Minor cleanups of BRIN code * Remove dead code in brin_form_tuple * Add missing Datum conversion macros * Use AttrNumber type in places using 1-based attribute numbers --- src/backend/access/brin/README | 2 +- src/backend/access/brin/brin.c | 4 ++-- src/backend/access/brin/brin_inclusion.c | 11 ++- src/backend/access/brin/brin_minmax.c| 16 +--- src/backend/access/brin/brin_tuple.c | 1 - 5 files changed, 18 insertions(+), 16 deletions(-) diff --git a/src/backend/access/brin/README b/src/backend/access/brin/README index 636d96545b..3e80bad62f 100644 --- a/src/backend/access/brin/README +++ b/src/backend/access/brin/README @@ -21,7 +21,7 @@ possible to support the amgettuple interface. Instead, we only provide amgetbitmap support. The amgetbitmap routine returns a lossy TIDBitmap comprising all pages in those page ranges that match the query qualifications. The recheck step in the BitmapHeapScan node prunes tuples -that are not visible according to the query qualifications. +that are not visible according to the query predicates. An operator class must have the following entries: diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index e95fbbcea7..0e8f4e0d08 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -252,7 +252,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls, PointerGetDatum(bdesc), PointerGetDatum(bval), values[keyno], - nulls[keyno]); + BoolGetDatum(nulls[keyno])); /* if that returned true, we need to insert the updated tuple */ need_insert |= DatumGetBool(result); } @@ -647,7 +647,7 @@ brinbuildCallback(Relation index, attr->attcollation, PointerGetDatum(state->bs_bdesc), PointerGetDatum(col), - values[i], isnull[i]); + values[i], BoolGetDatum(isnull[i])); } } diff --git a/src/backend/access/brin/brin_inclusion.c b/src/backend/access/brin/brin_inclusion.c index 6ce355c6a9..4918138014 100644 --- a/src/backend/access/brin/brin_inclusion.c +++ b/src/backend/access/brin/brin_inclusion.c @@ -81,10 +81,11 @@ typedef struct InclusionOpaque FmgrInfo strategy_procinfos[RTMaxStrategyNumber]; } InclusionOpaque; -static FmgrInfo *inclusion_get_procinfo(BrinDesc *bdesc, uint16 attno, +static FmgrInfo *inclusion_get_procinfo(BrinDesc *bdesc, AttrNumber attno, uint16 procnum); -static FmgrInfo *inclusion_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, -Oid subtype, uint16 strategynum); +static FmgrInfo * inclusion_get_strategy_procinfo(BrinDesc *bdesc, +AttrNumber attno, Oid subtype, +uint16 strategynum); /* @@ -588,7 +589,7 @@ brin_inclusion_union(PG_FUNCTION_ARGS) * or null if it is not exists. */ static FmgrInfo * -inclusion_get_procinfo(BrinDesc *bdesc, uint16 attno, uint16 procnum) +inclusion_get_procinfo(BrinDesc *bdesc, AttrNumber attno, uint16 procnum) { InclusionOpaque *opaque; uint16 basenum = procnum - PROCNUM_BASE; @@ -646,7 +647,7 @@ inclusion_get_procinfo(BrinDesc *bdesc, uint16 attno, uint16 procnum) * made here, see that function too. */ static FmgrInfo * -inclusion_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, Oid subtype, +inclusion_get_strategy_procinfo(BrinDesc *bdesc, AttrNumber attno, Oid subtype, uint16 strategynum) { InclusionOpaque *opaque; diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c index 0f6aa33a45..5f28cf362c 100644 --- a/src/backend/access/brin/brin_minmax.c +++ b/src/backend/access/brin/brin_minmax.c @@ -29,7 +29,7 @@ typedef struct MinmaxOpaque FmgrInfo strategy_procinfos[BTMaxStrategyNumber]; } MinmaxOpaque; -static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, +static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, AttrNumber attno, Oid subtype, uint16 strategynum); @@ -68,7 +68,7 @@ brin_minmax_add_value(PG_FUNCTION_ARGS) BrinDesc *bdesc = (BrinDesc *) PG_GETARG_POINTER(0); BrinValues *column = (BrinValues *) PG_GETARG_POINTER(1); Datum newval = PG_GETARG_DATUM(2); - bool isnull = PG_GETARG_DATUM(3); + bool isnull = PG_GETARG_BOOL(3); Oid colloid = PG_GET_COLLATION(); FmgrInfo *cmpFn; Datum compar; @@ -281,8 +281,9 @@ brin_minmax_union(PG_FUNCTION_ARGS) /* Adjust minimum, if B's min is less than A
Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock
On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin wrote: > > 8 дек. 2018 г., в 6:54, Alexander Korotkov > > написал(а): > > > > Yep, please find attached draft patch. > > Patch seems good to me, I'll check it in more detail. > The patch gets posting item at FirstOffsetNumber instead of > btree->getLeftMostChild(). This seem OK, since dataGetLeftMostPage() is doing > just the same, but with few Assert()s. I'd like to evade creating GinBtree for just calling getLeftMostChild(). Also, few more places in ginvacuum.c do the same. We have the same amount of Assert()s in ginVacuumPostingTreeLeaves(). So, let's keep it uniform. I would also like Peter Geoghegan to take a look at this patch before committing it. > > BTW, it seems that I've another bug in GIN. README says that > > > > "However, posting trees are only > > fully searched from left to right, starting from the leftmost leaf. (The > > tree-structure is only needed by insertions, to quickly find the correct > > insert location). So as long as we don't delete the leftmost page on each > > level, a search can never follow a downlink to page that's about to be > > deleted." > > > > But that's not really true once we teach GIN to skip parts of posting > > trees in PostgreSQL 9.4. So, there might be a risk to visit page to > > be deleted using downlink. But in order to get real problem, vacuum > > should past cleanup stage and someone else should reuse this page, > > before we traverse downlink. Thus, the risk of real problem is very > > low. But it still should be addressed. > > There's a patch above in this thread > 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I > propose stamping every deleted page with GinPageSetDeleteXid(page, > ReadNewTransactionId()); and avoid reusing the page before > TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalDataXmin). > Should we leave alone this bug for future fixes to keep current fix > noninvasive? I think since is separate bug introduced in PostgreSQL 9.4, which should be backpatched with separate commit. Could you please extract patch dealing with GinPageSetDeleteXid() and GinPageGetDeleteXid(). The rest of work made in your patch should be considered for master. BTW, what do you think about locking order in ginRedoDeletePage()? Do you have any explanation of current behavior? I'll try to reach Teodor tomorrow with this question. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Why does TupleDescInitBuiltinEntry lack a "default: error" case?
If the passed-in type OID isn't one of the very short list that TupleDescInitBuiltinEntry supports, it will silently hand back a broken TupleDesc, rather than throwing an error. How can this possibly be considered good code? (So far as I can see, none of the extant callers could hit such an error, but it's still bad code.) regards, tom lane
Re: Why does TupleDescInitBuiltinEntry lack a "default: error" case?
Hi, On 2018-12-09 14:49:35 -0500, Tom Lane wrote: > If the passed-in type OID isn't one of the very short list that > TupleDescInitBuiltinEntry supports, it will silently hand back > a broken TupleDesc, rather than throwing an error. How can > this possibly be considered good code? +1 Greetings, Andres Freund
Re: automatically assigning catalog toast oids
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > John Naylor writes: > > Commit 96cdeae07 added toast tables to most catalogs. One disadvantage > > is that the toast declarations require hard-coded oids, even though > > only shared catalogs actually need stable oids. Now that we can assign > > oids on the fly, it makes sense to do so for toast tables as well, as > > in the attached. > > I'm a bit dubious that this is a good idea. It's handy, at least for > forensic situations, that the system catalogs have stable OIDs. I tend to agree... What's the advantage of assigning them on the fly? Thanks! Stephen signature.asc Description: PGP signature
Re: automatically assigning catalog toast oids
Hi, On 2018-12-09 15:42:57 -0500, Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > John Naylor writes: > > > Commit 96cdeae07 added toast tables to most catalogs. One disadvantage > > > is that the toast declarations require hard-coded oids, even though > > > only shared catalogs actually need stable oids. Now that we can assign > > > oids on the fly, it makes sense to do so for toast tables as well, as > > > in the attached. > > > > I'm a bit dubious that this is a good idea. It's handy, at least for > > forensic situations, that the system catalogs have stable OIDs. Hm, but won't they have that for major versions anyway? We ought not to change the .bki generation in a way that results in differing oids after a release, no? > I tend to agree... What's the advantage of assigning them on the fly? No idea if that's John's reasoning, but I do like not having to do yet another manual step that you need to remember/figure out when adding a new catalog relation. Greetings, Andres Freund
Re: automatically assigning catalog toast oids
Andres Freund writes: > On 2018-12-09 15:42:57 -0500, Stephen Frost wrote: >> * Tom Lane (t...@sss.pgh.pa.us) wrote: > I'm a bit dubious that this is a good idea. It's handy, at least for > forensic situations, that the system catalogs have stable OIDs. > Hm, but won't they have that for major versions anyway? We ought not to > change the .bki generation in a way that results in differing oids after > a release, no? Well, that's just a different very-easily-broken assumption. There are a lot of things that make auto-assigned OIDs unstable, and I do not think that we want to guarantee that they'll hold still across a release series. BTW, now that I'm looking at it, I very much dislike the way commit 578b2297 handled auto-assignment of OIDs in catalogs. Not only do I not want to assign more OIDs that way, I don't think we can safely ship v12 like this. There's basically nothing at all that guarantees genbki-assigned OIDs won't overlap initdb-assigned OIDs. In fact, I think it's about guaranteed to blow up in the face of anybody who inserts manually-assigned OIDs above around 9000. What we probably need to do is restructure the FirstBootstrapObjectId business so that there are separate, well-defined ranges for genbki.pl and initdb to use. regards, tom lane
Re: automatically assigning catalog toast oids
Hi, On 2018-12-09 17:14:42 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2018-12-09 15:42:57 -0500, Stephen Frost wrote: > >> * Tom Lane (t...@sss.pgh.pa.us) wrote: > > I'm a bit dubious that this is a good idea. It's handy, at least for > > forensic situations, that the system catalogs have stable OIDs. > > > Hm, but won't they have that for major versions anyway? We ought not to > > change the .bki generation in a way that results in differing oids after > > a release, no? > > Well, that's just a different very-easily-broken assumption. There are > a lot of things that make auto-assigned OIDs unstable, and I do not think > that we want to guarantee that they'll hold still across a release series. Why wouldn't they be for genbki (rather than initdb) assigned oids? I don't think it's reasonable to add new functions or such post release that would have move oid assignments for other objects? > BTW, now that I'm looking at it, I very much dislike the way commit > 578b2297 handled auto-assignment of OIDs in catalogs. Not only do I not > want to assign more OIDs that way Why do you not want that? > , I don't think we can safely ship v12 like this. There's basically > nothing at all that guarantees genbki-assigned OIDs won't overlap > initdb-assigned OIDs. In fact, I think it's about guaranteed to blow > up in the face of anybody who inserts manually-assigned OIDs above > around 9000. I'm not sure I really see a pressing problem with that. I think it'd not be insane to treat this as a "well, don't do that then". > What we probably need to do is restructure the FirstBootstrapObjectId > business so that there are separate, well-defined ranges for genbki.pl > and initdb to use. I'm fine with adding a distinct range, the earlier version of the patch had that. I'd asked for comments if anybody felt a need to keep that, nobody replied... I alternatively proposed that we could just start at FirstNormalObjectId for those and update the server's oid start value to the maximum genbki assigned oid. Do you have preferences around that? Greetings, Andres Freund
Re: pg_partition_tree crashes for a non-defined relation
On Sun, Dec 09, 2018 at 02:07:29PM -0500, Tom Lane wrote: > Stephen Frost writes: >> I don't entirely buy off on the argument that it's code that's 'highly >> unlikely to break once written' though- we do add new relkinds from time >> to time, for example. Perhaps we could have these functions run just >> once per relkind. > > Well, the relevant code is likely to be "if relkind is not x, y, or z, > then PG_RETURN_NULL". If we add a new relkind and forget to consider the > function, the outcome is a NULL result that perhaps should not have been > NULL ... but a test like this won't help us notice that. Yes, in order to prevent problems with newly-introduced relkinds, I think that the checks within functions should be careful to check only for relkinds that they support, and not list those they do not support. -- Michael signature.asc Description: PGP signature
Re: pg_partition_tree crashes for a non-defined relation
On Sun, Dec 09, 2018 at 12:49:18PM -0500, Stephen Frost wrote: > Looks alright on a quick glance, but shouldn't you also update the > comment..? The comment on HEAD or with the patch is that: /* Only allow relation types that can appear in partition trees. */ This still looks adapted to me. Or would you reword it because "allow" rather implies that an ERROR is returned? Would you prefer changing it something like that perhaps: "Return NULL for relation types that do not appear in partition trees." -- Michael signature.asc Description: PGP signature
Re: automatically assigning catalog toast oids
Andres Freund writes: > On 2018-12-09 17:14:42 -0500, Tom Lane wrote: >> Well, that's just a different very-easily-broken assumption. There are >> a lot of things that make auto-assigned OIDs unstable, and I do not think >> that we want to guarantee that they'll hold still across a release series. > Why wouldn't they be for genbki (rather than initdb) assigned oids? I > don't think it's reasonable to add new functions or such post release > that would have move oid assignments for other objects? As you've got this set up, we couldn't change *anything* for fear of it moving auto-assignments; there's no isolation between catalogs. Another thing I seriously dislike is that this allows people to omit OIDs from .dat entries in catalogs where we traditionally hand-assign OIDs. That's not a good idea because it would mean those entries don't have stable OIDs, whereas the whole point of hand assignment is to ensure all built-in objects of a particular type have stable OIDs. Now, you could argue about the usefulness of that policy for any given catalog; but if we decide that catalog X doesn't need stable OIDs then that should be an intentional policy change, not something that can happen because one lazy hacker didn't follow the policy. (This is, btw, another reason why I don't much like allowing toast tables to not have stable OIDs; it confuses what the policy is for pg_class.) >> In fact, I think it's about guaranteed to blow >> up in the face of anybody who inserts manually-assigned OIDs above >> around 9000. > I'm not sure I really see a pressing problem with that. I think it'd not > be insane to treat this as a "well, don't do that then". I can name more than one company that will be pretty damn unhappy when we break their local patches that make use of 9K-range OIDs. For better or worse, the policy for the last 20 years has been that OIDs up to are available for hand assignment. What you've just done is to break that for OIDs above some boundary that's not even very well defined (though it seems to be somewhere around 8500 as of HEAD). >> What we probably need to do is restructure the FirstBootstrapObjectId >> business so that there are separate, well-defined ranges for genbki.pl >> and initdb to use. > I'm fine with adding a distinct range, the earlier version of the patch > had that. I'd asked for comments if anybody felt a need to keep that, > nobody replied... I alternatively proposed that we could just start at > FirstNormalObjectId for those and update the server's oid start value to > the maximum genbki assigned oid. Do you have preferences around that? Yeah, I thought about the latter as well. But it adds complexity to the bootstrap process and makes it harder to tell what assigned a particular OID, so I'd rather go with the former, at least until the OID situation gets too tight to allow for daylight between the ranges. It looks to me like as of HEAD, genbki.pl is auto-assigning about 1470 OIDs. Meanwhile, on my RHEL6 machine, initdb is auto-assigning about 1740 OIDs (what a coincidence); of those, 872 are collation entries that are absorbed from the system environment. So the second number is likely to vary a lot from platform to platform. (I don't have ICU enabled; I wonder how many that typically adds.) I'd be inclined to allow say 2000 OIDs for genbki.pl, with 4384 therefore available for initdb. We could expect to have to raise the boundary from time to time, but not very often. regards, tom lane
RE: Too many logs are written on Windows (LOG: could not reserve shared memory region (addr=%p) for child %p:)
From: Magnus Hagander [mailto:mag...@hagander.net] > In this particular case, it seems it *was* helpful, no? That's how you found > out the customer used a broken antivirus product, which may certainly also > cause *other* issues. > > Some sort of rate limiting to reduce the volume might help, but the message > itself seems to have clearly been useful. +1 We can change pgwin32_ReserveSharedMemoryRegion() to take an argument "int loglevel". Then the caller first calls it with LOG, and DEBUGx afterwards. It may also be helpful for the caller to output "LOG: tried %d times to reserve shared memory region" when the caller ran the function twice or more before success. That explains the possibly long time or CPU spikes of connection establishment. Regards Takayuki Tsunakawa
RE: Statement-level rollback
From: Andres Freund [mailto:and...@anarazel.de] > Isn't the realistic scenario for migrations that people will turn this > feature on on a more global basis? If they need to do per transaction choices, > that makes this much less useful for easy migrations. Agreed. My approach of per transaction choice may be overkill. Actually, I didn't think per-transaction change was really necessary in practice. But I didn't think of any reason to prohibit per transaction change, so I just wanted to allow flexibility. I think per database/user change (ALTER DATABASE/USER) can be useful, in cases where applications are migrated from other DBMSs to a PostgreSQL instance. That is, database consolidation for easier data analytics and database management. One database/schema holds data for a PostgreSQL application, and another one stores data for a migrated application. Or, should we recommend a separate PostgreSQL instance on a different VM or container, and just introduce the parameter only in postgresql.conf? Regards Takayuki Tsunakawa
RE: Too many logs are written on Windows (LOG: could not reserve shared memory region (addr=%p) for child %p:)
Hi Noah, Magnus and Tsunakawa-san, Thank you for replying. > Can you adopt pgbouncer, to reduce > the frequency of starting new backend processes? That should improve your > performance, too. Actually, before I found that F-secure causes this message, I recommend my customer to use connection pooling to reduce the number of connection times. > Could you collect the information > http://postgr.es/m/20181203053506.gb2860...@rfd.leadboat.com requests? > That may help us understand your system's unusual behavior. (The issue in > that > thread is related but not identical.) Sorry. Since my customer uses PostgreSQL in production environment, I cannot deploy debug modules. > In this particular case, it seems it *was* helpful, no? That's how you found > out the customer used a broken antivirus product, which may certainly also > cause *other* issues. > > Some sort of rate limiting to reduce the volume might help, but the message > itself seems to have clearly been useful. Yes. You are right. The message itself was useful. Your idea to reduce the volume seems good. > We can change pgwin32_ReserveSharedMemoryRegion() to take an argument "int > loglevel". Then the caller first calls it with LOG, and DEBUGx afterwards. > It may also be helpful for the caller to output "LOG: tried %d times to > reserve shared memory region" when the caller ran the function twice or > more before success. That explains the possibly long time or CPU spikes > of connection establishment. It seems good idea for me. Regards, Ryohei Takahashi
Re: Should new partitions inherit their tablespace from their parent?
On Mon, Dec 10, 2018 at 02:05:29AM +1300, David Rowley wrote: > On Fri, 7 Dec 2018 at 20:15, Michael Paquier wrote: >> -ATExecPartedIdxSetTableSpace(Relation rel, Oid newTableSpace) >> +ATExecSetTableSpaceNoStorage(Relation rel, Oid newTableSpace) >> NoStorage looks strange as routine name for this case. Would something >> like ATExecPartedRelSetTableSpace be more adapted perhaps? > > I'd say it's better to name the function in the more general purpose > way. Perhaps we'll get some relkind in the future that's not a > partitioned table or index that might need the reltablespace set. When > that happens, if we name the function the way you're proposing, then > we might end up with some other function which does the same thing, or > the patch adding the new relkind would have to rename the function to > something more like how I have it. To save from either of those > having to happen, would it not be better to give it the generic name > now? > > Instead of renaming the function, I've altered the Assert() to allow > all relkinds which don't have storage and also updated the comment to > remove the mention of partitioned tables and indexes. - Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX); + Assert(rel->rd_rel->relkind == RELKIND_VIEW || + rel->rd_rel->relkind == RELKIND_COMPOSITE_TYPE || + rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE || + rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE || + rel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX); Alvaro has begun a thread about that recently, so it seems to me that this assertion could become invalid (per se my comments down-thread): https://www.postgresql.org/message-id/20181206215552.fm2ypuxq6nhpwjuc@alvherre.pgsql Of course, it depends on the conclusion of this other thread. It will perhaps make sense at some point to review on which relkinds this function can work on, but I would recommend to review that behavior when it actually makes sense and keep the current interface simple. So your first version looked fine to me on those grounds. Another important thing which would help with a restricted assertion is that if this code path begins to be taken for other relkinds then the developer who implements a new feature depending on it will need to look at this code and consider the use-case behind it, instead of assuming that it will hopefully just work. >> + else if (stmt->partbound) >> + { >> + RangeVar *parent; >> + Relationparentrel; >> + >> + /* >> +* For partitions, when no other tablespace is specified, we default >> +* the tablespace to the parent partitioned table's. >> +*/ >> Okay, so the direct parent is what counts, and not the top-most parent. >> Could you add a test with multiple level of partitioned tables, like: >> - parent is in tablespace 1. >> - child is created in tablespace 2. >> - grandchild is created, which should inherit tablespace 2 from the >> child, but not tablespace 1 from the parent. In the existing example, >> as one partition is used to test the top-most parent and another for the >> new default, it looks cleaner to create a third partition which would be >> itself a partitioned table. > > Sounds good. I've modified the existing test just to do it that way. I > don't think there's a need to have multiple tests for that. > > I've attached an updated patch. Thanks for the new version. The new test case looks fine to me. -- Michael signature.asc Description: PGP signature
Re: Too many logs are written on Windows (LOG: could not reserve shared memory region (addr=%p) for child %p:)
On Mon, Dec 10, 2018 at 02:15:07AM +, Takahashi, Ryohei wrote: >> We can change pgwin32_ReserveSharedMemoryRegion() to take an argument "int >> loglevel". Then the caller first calls it with LOG, and DEBUGx afterwards. >> It may also be helpful for the caller to output "LOG: tried %d times to >> reserve shared memory region" when the caller ran the function twice or >> more before success. That explains the possibly long time or CPU spikes >> of connection establishment. > > It seems good idea for me. Like a couple of others on this thread I doubt that lowering the elevel here would be a good thing, as keeping it noisy has been what allows to know that something has gone wrong, no? The current behavior is useful. -- Michael signature.asc Description: PGP signature
Re: min_parallel_table_size and inheritence
On Sun, Dec 9, 2018 at 6:24 AM Justin Pryzby wrote: > > The docs say: > https://www.postgresql.org/docs/current/runtime-config-query.html > |min_parallel_table_scan_size Sets the minimum amount of table data that must > be scanned in order for a parallel scan to be considered. [...] > > I'd like to set parallel_min_table_size=32MB, but it looks like that won't do > what I intend for at least one of our tables using inheritence. > > It seems to me that an individual table should not be scanned in parallel if > its size is below the threshold, even if it's a child and has siblings which > are larger and scanned in parallel. > > I found that the current behavior seems to be more or less deliberate, Yes, you are right. > but > maybe should be revisited following implementation of "parallel append" node, > as previously discussed. > Why? How does parallel append help for individual table/partition scan? If you want the parallel scan to be forced for any individual table, you might want to try setting 'parallel_workers' parameter for that table. For example 'Alter Table tbl Set (parallel_workers=2);' -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Use durable_unlink for .ready and .done files for WAL segment removal
On Thu, Dec 06, 2018 at 11:18:12PM +, Bossart, Nathan wrote: > That seems reasonable to me. Thanks, committed after renaming a couple of variables, after adding a comment about the use of unlink() and also adding a comment explaining what the different retry counters are here for. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Log PostgreSQL version number on startup
On Wed, Nov 21, 2018 at 11:32:46AM -0500, Stephen Frost wrote: > * Christoph Berg (christoph.b...@credativ.de) wrote: >> it has bugged me for a long time that there's no clear "PostgreSQL is >> starting" message in the server log file. I'd like to change that for >> two reasons: > > +1 +1. One complain which could be formulated is that this makes the logs at startup more noisy. Now your patch has an issue if you want to ensure that this information gets added to the log files, because at this stage of startup the GUCs are not loaded, hence this is sent to stderr, and perhaps not on the log files. -- Michael signature.asc Description: PGP signature