Re: [PATCH] Add missing type conversion functions for PL/Python
+1, I also think that we may not change the previous behavior of plpython. @Nikita Glukhov maybe we just check the whether pyobject is int or long only in related conversion functions, and fallback otherwise? On Fri, Jul 13, 2018 at 12:09 AM Heikki Linnakangas wrote: > On 12/07/18 18:06, Nikita Glukhov wrote: > > I have added some cross-type test cases and now almost all new code is > covered > > (excluding several error cases which can be triggered only by custom > numeric > > type implementations). > > Thanks! Some of those new tests actually fail, if you run them against > unpatched master. For example: > > SELECT * FROM test_type_conversion_float8_int2(100::float8); > INFO: (100.0, ) > - test_type_conversion_float8_int2 > --- > - 100 > -(1 row) > - > +ERROR: invalid input syntax for integer: "100.0" > +CONTEXT: while creating return value > +PL/Python function "test_type_conversion_float8_int2" > > So this patch is making some subtle changes to behavior. I don't think > we want that. > > - Heikki > -- Regards, Haozhou
Re: Make foo=null a warning by default.
Hello David, Per a discussion with Andrew Gierth and Vik Fearing, both of whom helped make this happen, please find attached a patch which makes it possible to get SQL standard behavior for "= NULL", which is an error. It's been upgraded to a warning, and can still be downgraded to silence (off) and MS-SQL-compatible behavior (on). That's nice, and good for students, and probably others as well:-) A few comments: Patch applies & compiles, "make check" ok. #backslash_quote = safe_encoding# on, off, or safe_encoding [...] #transform_null_equals = warn I think it would be nice to have the possible values as a comment in "postgresql.conf". * Code: -bool operator_precedence_warning = false; +bool operator_precedence_warning = false; Is this reindentation useful/needed? + parser_errposition(pstate, a->location))); + if (need_transform_null_equals && transform_null_equals == TRANSFORM_NULL_EQUALS_ON) For consistency, maybe skip a line before the if? transform_null_equals_options[] = { [...] I do not really understand the sort order of this array. Maybe it could be alphabetical, or per value? Or maybe it is standard values and then synonyms, in which is case maybe a comment on the end of the list. Guc help text: ISTM that it should spell out the possible values and their behavior. Maybe something like: """ When turned on, turns expr = NULL into expr IS NULL. With off, warn or error, do not do so and be silent, issue a warning or an error. The standard behavior is not to perform this transformation. Default is warn. """ Or anything in better English. * Test +select 1=null; + ?column? Maybe use as to describe the expected behavior, so that it is easier to check, and I think that "?column?" looks silly eg: select 1=null as expect_{true,false}[_and_a_warning/error]; create table cnullchild (check (f1 = 1 or f1 = null)) inherits(cnullparent); +WARNING: = NULL can only produce a NULL I'm not sure of this test case. Should it be turned into "is null"? Maybe the behavior could be retested after the reset? * Documentation: The "error" value is not documented. More generally, The value is said to be an enum, but the list of values is not listed anywhere in the documentation. ISTM that the doc would be clearer if for each of the four values of the parameter the behavior is spelled out. Maybe "warn" in the doc should be warn or something. -- Fabien.
Re: Make foo=null a warning by default.
On 16/07/18 04:40, David Fetter wrote: Folks, Per a discussion with Andrew Gierth and Vik Fearing, both of whom helped make this happen, please find attached a patch which makes it possible to get SQL standard behavior for "= NULL", which is an error. It's been upgraded to a warning, and can still be downgraded to silence (off) and MS-SQL-compatible behavior (on). I don't agree with changing the default to 'warn'. "foo = NULL" is perfectly legal SQL, even if it's not very useful in practice. The use case for transform_null_equals='warn' that I see is to wean off an application that uses transform_null_equals='on', to find all the queries that rely on it. But I'm not too excited about that either. The documented case for using transform_null_equals='on' is compatibility with Microsoft Access, and this wouldn't help with that. Besides, it's easy enough to grep the application or the server log for " = NULL", to find all such queries. - Heikki
Parallel queries in single transaction
Hello, I am working with data warehouse based on postgresql and would like to propose a feature. The idea is to give control and ability for developer to execute queries in parallel within single transaction. Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and can be executed in parallel mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END Of course QUERY1 and QUERY2 can be dependent and then this would not work, but sometimes it is useful, especially when you have bound to e.g. CPU and query stuck. If we go further, the postgresql engine could possible find such cases by itself and run queries in parallel mode, but that’s sounds too far. Here is also an example in scala how you can wait for several futures: https://stackoverflow.com/a/16257851/2439539 Syntax: BEGIN; --create two temp tables because prepare does not support CTAS —query1 CREATE TEMP TABLE QUERY1_RESULT ON COMMIT DROP (…); PREPARE query1 (id int, val varchar) as INSERT INTO QUERY1_RESULT(...) SELECT … —query2 CREATE TEMP TABLE QUERY2_RESULT ON COMMIT DROP (…); PREPARE query2 (id int, val varchar) as INSERT INTO QUERY2_RESULT(...) SELECT … —exec in parallel execute parallel (query1, query2); —query3 …. END; -Paul
Re: Make foo=null a warning by default.
Hello Heikki, The use case for transform_null_equals='warn' that I see is to wean off an application that uses transform_null_equals='on', to find all the queries that rely on it. But I'm not too excited about that either. The documented case for using transform_null_equals='on' is compatibility with Microsoft Access, and this wouldn't help with that. Besides, it's easy enough to grep the application or the server log for " = NULL", to find all such queries. I'm in favor in having this feature, even if off by default, because I could enable it and it would help my students when learning SQL in interactive sessions, where "grep '= *NULL'" cannot be issued. -- Fabien.
RE: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
I get it. Thank you for this precision. Regards Phil De : David Rowley Envoyé : lundi 16 juillet 2018 07:48 À : Phil Florent Cc : Tom Lane; Robert Haas; Amit Langote; PostgreSQL Hackers Objet : Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian On 16 July 2018 at 16:56, Phil Florent mailto:philflor...@hotmail.com>> wrote: I should post that in the general section but I am confused by the sentence "A parent partition is always going to have a lower relid than its children" It's a little confusing since RelOptInfo has a relid field and so does RangeTblEntry. They both have completely different meanings. RelOptInfo's relid is a number starting at 1 and continues in a gapless sequence increasing by 1 with each RelOptInfo. These relids are completely internal to the server and don't appear in the system catalog tables. RangeTblEntry's relid is what's in pg_class.oid. I was talking about RelOptInfo's relid. Using relids starting at 1 is quite convenient for allowing direct array lookups in various data structures in the planner. However it's also required to uniquely identify a relation as a single table may appear many times in a query, so trying to identify them by their oid could be ambiguous. Also, some RTEKinds don't have storage, e.g a VALUES() clause. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Libpq support to connect to standby server as priority
On Wed, Jul 11, 2018 at 6:00 PM Haribabu Kommi wrote: > > > On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe > wrote: > >> Haribabu Kommi wrote: >> >> > - I think the construction with "read_write_host_index" makes the code >> even more >> complicated than it already is. >> >> What about keeping the first successful connection open and storing it >> in a >> variable if we are in "prefer-read" mode. >> If we get the read-only connection we desire, close that cached >> connection, >> otherwise use it. >> > > Even if we add a variable to cache the connection, I don't think the logic > of checking > the next host for the read-only host logic may not change, but the extra > connection > request to the read-write host again will be removed. > I evaluated your suggestion of caching the connection and reuse it when there is no read only server doesn't find, but I am thinking that it will add more complexity and also the connection to the other servers delays, the cached connection may be closed by the server also because of timeout. I feel the extra time during connection may be fine, if user is preferring the prefer-read mode, instead of adding more complexity in handling the cached connection? comments? Regards, Haribabu Kommi Fujitsu Australia
Re: patch to allow disable of WAL recycling
On 07/16/2018 04:54 AM, Stephen Frost wrote: Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: I think that the right basic idea is to have a GUC that chooses between the two implementations, but whether it can be set automatically is not clear to me. Can initdb perhaps investigate what kind of filesystem the WAL directory is sitting on, and set the default value from hard-wired knowledge about that? Maybe.. but I think we'd still need a way to change it because people often start with their database system minimally configured (including having WAL in the default location of the data directory) and only later realize that was a bad idea and change it later. I wouldn't be at all surprised if that "change it later" meant moving it to a different filesystem, and having to re-initdb to take advantage of that would be particularly unfriendly. I'm not sure the detection can be made sufficiently reliable for initdb. For example, it's not that uncommon to do initdb and then move the WAL to a different filesystem using symlink. Also, I wonder how placing the filesystem on LVM with snapshotting (which kinda makes it CoW) affects the system behavior. But maybe those are not issues, as long as the result is predictable. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel queries in single transaction
Hi, On 07/16/2018 09:45 AM, Paul Muntyanu wrote: Hello, I am working with data warehouse based on postgresql and would like to propose a feature. The idea is to give control and ability for developer to execute queries in parallel within single transaction. Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and can be executed in parallel mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END Of course QUERY1 and QUERY2 can be dependent and then this would not work, but sometimes it is useful, especially when you have bound to e.g. CPU and query stuck. I'm struggling to understand what would be the possible benefits. Either the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be both at the same time. If a query is stuck, running it concurrently is pointless. If they are CPU-bound, we can run them in parallel (which should produce the results faster). I'd even dare to say that running the queries concurrently can easily hinder performance, because the queries will compete for parallel workers, preventing some of them from running in parallel mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel queries in single transaction
Hi Tomas, thanks for looking into. I am more talking about queries which can not be optimized, e.g. * fullscan of the table and heavy calculations for another one. * query through FDW for both queries(e.g. one query fetches data from Kafka and another one is fetching from remote Postgres. There are no bounds for both queries for anything except local CPU, network and remote machine) IO bound is not a problem in case if you have multiple tablesapces. And CPU bound can be not the case when you have 32 cores and 6 max workers per query. Then, during nigtly ETL, I do not have anything except single query running) == 6 cores are occupied. If I can run queries in parallel, I would occupy two IO stacks(two tablespaces) + 12 cores instead of sequentially 6 and then again 6. Hope that makes sense -P On 16 Jul 2018, at 11:44, Tomas Vondra wrote: Hi, On 07/16/2018 09:45 AM, Paul Muntyanu wrote: Hello, I am working with data warehouse based on postgresql and would like to propose a feature. The idea is to give control and ability for developer to execute queries in parallel within single transaction. Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and can be executed in parallel mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END Of course QUERY1 and QUERY2 can be dependent and then this would not work, but sometimes it is useful, especially when you have bound to e.g. CPU and query stuck. I'm struggling to understand what would be the possible benefits. Either the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be both at the same time. If a query is stuck, running it concurrently is pointless. If they are CPU-bound, we can run them in parallel (which should produce the results faster). I'd even dare to say that running the queries concurrently can easily hinder performance, because the queries will compete for parallel workers, preventing some of them from running in parallel mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix some error handling for read() and errno
On Sat, Jul 14, 2018 at 03:37:56PM +0900, Michael Paquier wrote: > For now, I think that just moving forward with 0001, and then revisit > 0002 once the other 2PC patch is settled makes the most sense. On the > other thread, the current 2PC behavior can create silent data loss so > I would like to back-patch it, so that would be less work. Are there any objections with this plan? If none, then I would like to move on with 0001 as there is clearly a consensus to simplify the work of translators and to clean up the error code paths for read() calls. Let's sort of the rest after the 2PC code paths are addressed. -- Michael signature.asc Description: PGP signature
Re: [WIP PATCH] Index scan offset optimisation using visibility map
Hello. Thanks a lot for your feedback. I'll try to update patch in few days (currently stuck at small performance regression in unknown place). Regarding issue with delete: yes, it is valid point, but record removing should clear visibility buffer - and tuple will be fetched from heap to test its existance. In such case expression are not evaluated at all. Not sure for delete and query in same transaction - I'll check. Also, need to recheck possible issues with EvalPlanQual. PS. Updated link in case someone want to briefly see code until git patch is ready: https://github.com/michail-nikolaev/postgres/compare/e3eb8be77ef82ccc8f87c515f96d01bf7c726ca8...michail-nikolaev:index_only_fetch?ts=4 сб, 14 июл. 2018 г. в 0:17, Heikki Linnakangas : > On 21/05/18 18:43, Michail Nikolaev wrote: > > Hello everyone. > > This letter related to “Extended support for index-only-scan” from my > > previous message in the thread. > > > > WIP version of the patch is ready for a while now and I think it is time > to > > resume the work on the feature. BTW, I found a small article about Oracle > > vs Postgres focusing this issue - > > https://blog.dbi-services.com/postgres-vs-oracle-access-paths-viii/ > > > > Current WIP version of the patch is located here - > > > https://github.com/postgres/postgres/compare/88ba0ae2aa4aaba8ea0d85c0ff81cc46912d9308...michail-nikolaev:index_only_fetch > , > > passing all checks. In addition, patch includes small optimization for > > caching of amcostestimate results. > > Please submit an actual path, extracted e.g. with "git format-patch > -n1", rather than a link to an external site. That is a requirement for > archival purposes, so that people reading the email archives later on > can see what was being discussed. (And that link doesn't return a proper > diff, anyway.) > > > For now, I decide to name the plan as “Index Only Fetch Scan”. Therefore: > > * In case of “Index Scan” – we touch the index and heap for EVERY tuple > we > > need to test > > * For “Index Only Scan” – we touch the index for every tuple and NEVER > > touch the heap > > * For “Index Only Fetch Scan” – we touch the index for every tuple and > > touch the heap for those tuples we need to RETURN ONLY. > > Hmm. IIRC there was some discussion on doing that, when index-only scans > were implemented. It's not generally OK to evaluate expressions based on > data that has already been deleted from the table. As an example of the > kind of problems you might get: > > Imagine that a user does "DELETE * FROM table WHERE div = 0; SELECT * > FROM table WHERE 100 / div < 10". Would you expect the query to throw a > "division by zero error"? If there was an index on 'div', you might > evaluate the "100 / div" expression based on values from the index, > which still includes entries for the just-deleted tuples with div = 0. > They would be filtered out later, after performing the visibility > checks, but it's too late if you already threw an error. > > Now, maybe there's some way around that, but I don't know what. Without > some kind of a solution, this won't work. > > - Heikki >
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Sat, Jul 14, 2018 at 2:41 AM, Alvaro Herrera wrote: > On 2018-Jul-13, Ashutosh Bapat wrote: > >> On Fri, Jul 13, 2018 at 1:15 PM, Amit Langote >> wrote: >> >> >> >> I don't think this is true. When equality conditions and IS NULL clauses >> >> cover >> >> all partition keys of a hash partitioned table and do not have >> >> contradictory >> >> clauses, we should be able to find the partition which will remain >> >> unpruned. >> > >> > I was trying to say the same thing, but maybe the comment doesn't like it. >> > How about this: >> > >> > + * For hash partitioning, if we found IS NULL clauses for some keys >> > and >> > + * OpExpr's for others, gen_prune_steps_from_opexps() will generate >> > the >> > + * necessary pruning steps if all partition keys are taken care of. >> > + * If we found only IS NULL clauses, then we can generate the pruning >> > + * step here but only if all partition keys are covered. >> > >> >> It's still confusing a bit. I think "taken care of" doesn't convey the >> same meaning as "covered". I don't think the second sentence is >> necessary, it talks about one special case of the first sentence. But >> this is better than the prior version. > > Hmm, let me reword this comment completely. How about the attached? > > I also changed the order of the tests, putting the 'generate_opsteps' > one in the middle instead of at the end (so the not-null one doesn't > test that boolean again). AFAICS it's logically the same, and it makes > more sense to me that way. That looks much better. However it took me a small while to understand that (1), (2) and (3) correspond to strategies. > > I also changed the tests so that they apply to the existing mc2p table, > which is identical to the one Amit was adding. +1 for that. > >> > How about if we explicitly spell out the strategies like this: >> > >> > +if (!bms_is_empty(nullkeys) && >> > +(part_scheme->strategy == PARTITION_STRATEGY_LIST || >> > + part_scheme->strategy == PARTITION_STRATEGY_RANGE || >> > + (part_scheme->strategy == PARTITION_STRATEGY_HASH && >> > + bms_num_members(nullkeys) == part_scheme->partnatts))) >> >> I still do not understand why do we need (part_scheme->strategy == >> PARTITION_STRATEGY_HASH && bms_num_members(nullkeys) == >> part_scheme->partnatts) there. I thought that this will be handled by >> code, which deals with null keys and op_exprs together, somewhere >> else. > > I'm not sure I understand this question. This strategy applies to hash > partitioning only if we have null tests for all keys, so there are no > op_exprs. If there are any, there's no point in checking them. With the rearranged code, it's much more simple to understand this code. No change needed. >> >> Hmm. Ok. May be it's better to explicitly say that it's only useful in >> case of list partitioned table. > > Yeah, maybe. No need with the re-written code. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: PATCH: psql tab completion for SELECT
(trimmed CC list to evade gmail's spam filter, sorry) On 21/03/18 08:51, Edmund Horner wrote: Hi all, I haven't heard anything for a while and so assume you're beavering away on real features. :) I've been dogfooding this patch at work, and I am personally pretty happy with it. I still think the number of completions on an empty string is a bit too big, but I don't know what to omit. There are around 1700 completions on the empty "postgres" database in my testing, and we show the first 1000 (arbitrarily limited, as the generated SQL has LIMIT 1000 but no ORDER BY). Should we just leave it as is? Whether we improve the filtering or not, I'm optimistic the feature will be committed in this CF or the next. I've also been working on adding support for completions after commas, but I really want to see the current feature finished first. Playing around with this a little bit, I'm not very satisfied with the completions. Sometimes this completes too much, and sometimes too little. All of this has been mentioned in this and the other thread [1] already, this just my opinion on all this. Too much: postgres=# \d lp Table "public.lp" Column | Type | Collation | Nullable | Default --+-+---+--+- id | integer | | | partkey | text| | | partcol1 | text| | | partcol2 | text| | | Partition key: LIST (partkey) Number of partitions: 1000 (Use \d+ to list them.) postgres=# select pa[TAB] pad_attribute parameter_default parameter_stylepartattrs partcol2 partexprs partrelid page parameter_mode parameter_typespartclass partcollation partkeypartstrat pageno parameter_name parse_ident( partcol1 partdefid partnatts passwd Too little: postgres=# select partkey, p[TAB] [no completions] Then there's the multi-column case, which seems weird (to a user - I know the implementation and understand why): postgres=# select partkey, partc[TAB] [no completions] And I'd love this case, where go back to edit the SELECT list, after already typing the FROM part, to be smarter: postgres=# select p[TAB] FROM lp; Display all 370 possibilities? (y or n) There's something weird going on with system columns, from a user's point of view: SELECT oi[TAB] oid oidvectortypes( postgres=# select xm[TAB] xmin xmlcomment( xml_is_well_formed_content( xmlvalidate( xmlagg( xml_is_well_formed( xml_is_well_formed_document( So oid and xmin are completed. But "xmax" and "ctid" are not. I think this is because in fact none of the system columns are completed, but there happen to be some tables with regular columns named "oid" and "xmin". So it makes sense once you know that, but it's pretty confusing to a user. Tab-completion is a great way for a user to explore what's available, so it's weird that some system columns are effectively completed while others are not. I'd like to not include set-returning functions from the list. Although you can do "SELECT generate_series(1,10)", I'd like to discourage people from doing that, since using set-returning functions in the target list is a PostgreSQL extension to the SQL standard, and IMHO the "SELECT * FROM generate_series(1,10)" syntax is easier to understand and works more sanely with joins etc. Conversely, it would be really nice to include set-returning function in the completions after FROM. I understand that there isn't much you can do about some of those things, short of adding a ton of new context information about previous queries and heuristics. I think the completion needs to be smarter to be acceptable. I don't know what exactly to do, but perhaps someone else has ideas. I'm also worried about performance, especially of the query to get all the column names. It's pretty much guaranteed to do perform a SeqScan+Sort+Unique on pg_attribute. pg_attribute can be very large. In my little test database, with the above 1000-partition table, hitting tab after "SELECT " takes about 1 second, until you get the "Display all 1000 possibilities" prompt. And that's not a particularly large schema. - Heikki PS. All the references to "pg_attribute" and other system tables, and functions, need to be schema-qualified, as "pg_catalog.pg_attribute" and so forth. [1] https://www.postgresql.org/message-id/1328820579.11241.4.ca...@vanquo.pezone.net
Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan
On 12.07.18 11:12, Pavel Stehule wrote: > This appears to be the patch of record in this thread. I think there is > general desire for adding a setting like this, and the implementation is > simple enough. > > One change perhaps: How about naming the default setting "auto" instead > of "default". That makes it clearer what it does. > > > I changed "default" to "auto" > > updated patch attached committed with some editing -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Finding database for pg_upgrade missing library
On Sat, Jul 14, 2018 at 12:14:51AM +0200, Daniel Gustafsson wrote: > > On 13 Jul 2018, at 18:28, Bruce Momjian wrote: > > > > I received a private pg_upgrade feature request to report the database > > name for missing loadable libraries. Currently we report "could not > > load library" and the library file name, e.g. $libdir/pgpool-regclass. > > > > The request is that we report the _database_ name that contained the > > loadable library reference. However, that isn't easy to do because we > > gather all loadable library file names, sort them, and remove > > duplicates, for reasons of efficiency and so we check libraries in a > > predictable alphabetical order. > > > > Is it worth modifying pg_upgrade to report the first or all databases > > that contain references to missing loadable libraries? I don't think > > so, but I wanted to ask here. > > I ran into that very problem when upgrading 50+ clusters during a long night a > while back, and patched pg_upgrade to report the database which helped a lot > (or at least helped me be a bit lazy). So, +1 on the feature from me. If > there is interest I can see if I can dig out the patch and polish it up. Yes, please post the patch. Seems we now have three people who want this. Even though it is related to reporting errors, I think this is a new feature so will only be in PG 12. Looking at the code, we do a qsort(), then detect (since they are all now adjacent) and remove the duplicate references to the library. What I think should be done is to move the duplicate detection down to the place where we check for the library, then print all the database names of the duplicates if we don't find the library. I guess we either need a separate array for the database name, or a 'struct' that holds the library name and database name. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 07/16/2018 12:16 AM, Tomas Vondra wrote: On 07/15/2018 04:43 PM, Dean Rasheed wrote: On 15 July 2018 at 14:29, Tomas Vondra wrote: It's quite unclear to me how this algorithm could reliably end up with hist_sel=0 (in cases where we already don't end up with that). I mean, if a bucket matches the conditions, then the only way to eliminate is by deducing that MCV already contains all the matches - and that's rather difficult for complex clauses ... Ah, I didn't realise that you were using histograms for equality clauses as well. I had assumed that they would only use the MCV stats, as in the univariate case. Using histograms for equality seems problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL, as things stand that would end up with an estimate of half the bucket's frequency, which seems excessive. Yeah, I think you're right - this is likely to produce over-estimates and needs rethinking. With top-level equality clauses it should be fairly trivial to use approach similar to the univariate case, i.e. computing ndistinct and using (1 - mcv_totalsel) / ndistinct If there are ndistinct coefficients this might be pretty good estimate of the non-MCV part, I think. But it only works for top-level equalities, not for complex clauses as in your examples. On further thought, it's a bit more complicated, actually. Firstly, we already do that when there's no histogram (as in your example), and clearly it does not help. I initially thought it's a mistake to use the histogram in this case, but I can think of cases where it helps a lot. 1) when the equality clauses match nothing In this case we may not find any buckets possibly matching the combination of values, producing selectivity estimate 0.0. While by using 1/ndistinct would give us something else. 2) when there are equality and inequality clauses Similarly to the previous case, the equality clauses are useful in eliminating some of the buckets. Now, I agree estimating equality clauses using histogram is tricky, so perhaps what we should do is using them as "conditions" to eliminate histogram buckets, but use ndistinct to estimate the selectivity. That is something like this: P(a=1 & b=1 & c<10 & d>=100) = P(a=1 & b=1) * P(c<10 & d>=100 | a=1 & b=1) = 1/ndistinct(a,b) * P(c<10 & d>=100 | a=1 & b=1) where the second part is estimated using the histogram. Of course, this still only works for the top-level equality clauses :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 07/15/2018 11:36 AM, Dean Rasheed wrote: On 13 July 2018 at 18:27, Tomas Vondra wrote: I'm not so sure. The issue is that a lot of the MCV deductions depends on whether we can answer questions like "Is there a single match?" or "If we got a match in MCV, do we need to look at the non-MCV part?" This is not very different from the single-column estimates, except of course here we need to look at multiple columns. The top-level clauses allow us to make such deductions, with deeper clauses it's much more difficult (perhaps impossible). Because for example with (a=1 AND b=1) there can be just a single match, so if we find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR b=2)) it's not that simple, because there may be multiple combinations and so a match in MCV does not guarantee anything. Actually, it guarantees a lower bound on the overall selectivity, and maybe that's the best that we can do in the absence of any other stats. Hmmm, is that actually true? Let's consider a simple example, with two columns, each with just 2 values, and a "perfect" MCV list: a | b | frequency --- 1 | 1 | 0.5 2 | 2 | 0.5 And let's estimate sel(a=1 & b=2). Your proposed algorithm does this: 1) sel(a=1) = 0.5 2) sel(b=2) = 0.5 3) total_sel = sel(a=1) * sel(b=2) = 0.25 4) mcv_sel = 0.0 5) total_sel = Max(total_sel, mcv_sel) = 0.25 How is that a lower bound? Or what is it lower than? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel queries in single transaction
On 07/16/2018 12:03 PM, Paul Muntyanu wrote: Hi Tomas, thanks for looking into. I am more talking about queries which can not be optimized, e.g. * fullscan of the table and heavy calculations for another one. * query through FDW for both queries(e.g. one query fetches data from Kafka and another one is fetching from remote Postgres. There are no bounds for both queries for anything except local CPU, network and remote machine) IO bound is not a problem in case if you have multiple tablesapces. But it was you who mentioned "query stuck" not me. I merely pointed out that in such cases running queries concurrently won't help. And CPU bound can be not the case when you have 32 cores and 6 max workers per query. Then, during nigtly ETL, I do not have anything except single query running) == 6 cores are occupied. If I can run queries in parallel, I would occupy two IO stacks(two tablespaces) + 12 cores instead of sequentially 6 and then again 6. Well, sure. But you could just as well open multiple connections and make the queries concurrent that way. Or change the GUC to increase the number of workers for the nightly ETL. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Finding database for pg_upgrade missing library
On Fri, Jul 13, 2018 at 10:15:34PM -0500, Justin T Pryzby wrote: > On Fri, Jul 13, 2018 at 12:28:15PM -0400, Bruce Momjian wrote: > > I received a private pg_upgrade feature request to report the database > > name for missing loadable libraries. Currently we report "could not > > load library" and the library file name, e.g. $libdir/pgpool-regclass. > > > > The request is that we report the _database_ name that contained the > > loadable library reference. However, that isn't easy to do because we > > gather all loadable library file names, sort them, and remove > > duplicates, for reasons of efficiency and so we check libraries in a > > predictable alphabetical order. > > > > Is it worth modifying pg_upgrade to report the first or all databases > > that contain references to missing loadable libraries? I don't think > > so, but I wanted to ask here. > > Yes please, with a preference for the "all databases" option. > > We typically have only 4 DBs, including postgres and template1,2. It's > annoying enough when an upgrade process breaks because pg_repack or > pg_stat_buffercache installed into postgres DB. But it's a veritable pain > when > you discover in the middle of an upgrade that postgis had been somehow loaded > into template1, needs to be uninstalled (or upgraded from 22 to 23 to allow > upgrade), old postgis package was already removed.. Maybe you find that one > library was installed one place, fix it and restart the upgrade process. Then > it fails because the old library was also installed some other place.. > > When I've had to figure this out in the past, I ended up grepping the dumps to > figure out what old library was where. OK, we now have three people who want this so we will get it into PG 12. Thanks. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 16 July 2018 at 13:23, Tomas Vondra wrote: >>> The top-level clauses allow us to make such deductions, with deeper >>> clauses it's much more difficult (perhaps impossible). Because for >>> example with (a=1 AND b=1) there can be just a single match, so if we >>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR >>> b=2)) it's not that simple, because there may be multiple combinations >>> and so a match in MCV does not guarantee anything. >> >> Actually, it guarantees a lower bound on the overall selectivity, and >> maybe that's the best that we can do in the absence of any other >> stats. >> > Hmmm, is that actually true? Let's consider a simple example, with two > columns, each with just 2 values, and a "perfect" MCV list: > > a | b | frequency >--- > 1 | 1 | 0.5 > 2 | 2 | 0.5 > > And let's estimate sel(a=1 & b=2). OK.In this case, there are no MCV matches, so there is no lower bound (it's 0). What we could do though is also impose an upper bound, based on the sum of non-matching MCV frequencies. In this case, the upper bound is also 0, so we could actually say the resulting selectivity is 0. Regards, Dean
Re: cursors with prepared statements
On 11.07.18 19:07, Heikki Linnakangas wrote: >> One point worth pondering is how to pass the parameters of the prepared >> statements. The actual SQL standard syntax would be >> >> DECLARE cursor_name CURSOR FOR prepared_statement_name; >> OPEN cursor_name USING param, param; >> >> But since we don't have the OPEN statement in direct SQL, it made sense >> to me to attach the USING clause directly to the DECLARE statement. > > Hmm. I'm not excited about adding PostgreSQL-extensions to the SQL > standard. Isn't that what we do all the time? > It's confusing, and risks conflicting with future additions to > the standard. ECPG supports the actual standard syntax, with OPEN, > right? So this wouldn't be consistent with ECPG, either. It would be consistent for the case of no parameters. >> Curiously, the direct EXECUTE statement uses the non-standard syntax >> >> EXECUTE prep_stmt (param, param); >> >> instead of the standard >> >> EXECUTE prep_stmt USING param, param; >> >> I tried to consolidate this. But using >> >> DECLARE c CURSOR FOR p (foo, bar) >> >> leads to parsing conflicts (and looks confusing?), > > How about > > DECLARE c CURSOR FOR EXECUTE p (foo, bar) That's not the standard syntax for the case of no parameters. > The attached patch seems to do the trick, of allowing EXECUTE + USING. > I'm not sure this is worth the trouble, though, since EXECUTE as a plain > SQL command is a PostgreSQL-extension anyway. I think it's a PostgreSQL extension that we allow just about anything to be executed directly. So we should still use the standard syntax either way. It would be weird if EXECUTE or any other command had different syntax in direct SQL, ECPG, PL/pgSQL, etc. We have some differences already, but we shouldn't create more. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Usage of epoch in txid_current
Hi, On 2018-07-15 16:41:35 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-07-09 19:56:25 -0400, Tom Lane wrote: > >> Or, perhaps, use a struct in assert builds and int64 otherwise? > >> You could hide the ensuing notational differences in macros. > > > That should be doable. But I'd like to check if it's necessary > > first. Optimizing passing an 8 byte struct shouldn't be hard for > > compilers these days - especially when most things dealing with them are > > inline functions. If we find that it's not a problem on contemporary > > compilers, it might be worthwhile to use a bit more type safety in other > > places too. > > [ bunch of test results ] > Offhand it would seem that we can get away with struct wrappers > on any platform where performance is really of concern today. Cool, thanks for checking! Greetings, Andres Freund
Re: patch to allow disable of WAL recycling
On 2018-07-15 20:32:39 -0400, Robert Haas wrote: > On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund wrote: > > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE* > > performance impact of non COW filesystems, and very likely even negative > > impacts in a number of COWed scenarios (when there's enough memory to > > keep all WAL files in memory). > > > > I still think that fixing this another way would be preferrable. This'll > > be too much of a magic knob that depends on the fs, hardware and > > workload. > > I tend to agree with you, but unless we have a pretty good idea what > that other way would be, I think we should probably accept the patch. I don't think I've argued against that - I just want there to be sufficient caveats to make clear it's going to hurt on very common OS & FS combinations. > I think part of the problem here is that whether a WAL segment is > likely to be cached depends on a host of factors which we don't track > very carefully, like whether it's been streamed or decoded recently. > If we knew when that a particular WAL segment hadn't been accessed for > any purpose in 10+ minutes, it would probably be fairly safe to guess > that it's no longer in cache; if we knew that it had been accessed <15 > seconds ago, that it is probably still in cache. But we have no idea. True. Additionally we don't know whether, even if cold cache, re-initializing files isn't worse performance-wise than recycling files. Greetings, Andres Freund
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Hi, On 2018-07-15 18:48:43 -0400, Tom Lane wrote: > So basically, WAL replay hits an error while holding a buffer pin, and > nothing is done to release the buffer pin, but AtProcExit_Buffers thinks > something should have been done. I think there's a few other cases where we hit this. I've seen something similar from inside checkpointer / BufferSync(). I'd be surprised if bgwriter couldn't be triggered into the same. > What seems like a better answer for now is to adjust AtProcExit_Buffers > so that it doesn't cause an assertion failure in this case. I think we > can define "this case" as being "failure exit from the startup process": > if that happens, the postmaster is going to throw up its hands and force > a panic shutdown anyway, so the failure to free a buffer pin shouldn't > have any serious consequences. > The attached one-liner patch does that, and is enough to get through > Michael's test case without an assertion. This is just proof of concept > though --- my inclination, if we go this route, is to make a slightly > longer patch that would fix CheckForBufferLeaks to still print the WARNING > messages if any, but not die with an assertion afterwards. > > Another question is whether this is really a sufficient band-aid. It > looks to me like AtProcExit_Buffers will be called in any auxiliary > process type, not only the startup process. We could just replace the Assert() with a PANIC? > Do, or should we, force a panic restart for nonzero-exit-code failures > of all aux process types? If not, what are we going to do to clean up > similar failures in other aux process types? I'm pretty sure that we do *not* force a panic on all nonzero-exit-code cases for other subprocesses. > BTW, this assertion has been there since fe548629; before that, there > was code that would release any leaked buffer pins, relying on the > PrivateRefCount data for that. So another idea is to restore some > version of that capability, although I think we really really don't > wanna scan the PrivateRefCount array if we can avoid it. I don't think scanning PrivateRefCount would be particularly problematic - in almost all cases it's going to be tiny. These day it's not NBuffers sized anymore, so I can't forsee any performance problems? I think we could invent something like like enter/leave "transactional mode" wherein we throw a PANIC when a buffer leaked. Processes that don't enter it - like startup, checkpointer, etc - would just WARN? Greetings, Andres Freund
Re: patch to allow disable of WAL recycling
Hi, On 2018-07-15 20:55:38 -0400, Tom Lane wrote: > That's not the way to think about it. On a COW file system, we don't > want to "create 16MB files" at all --- we should just fill WAL files > on-the-fly, because the pre-fill activity isn't actually serving the > intended purpose of reserving disk space. It's just completely useless > overhead :-(. So we can't really make a direct comparison between the > two approaches; there's no good way to net out the cost of constructing > the WAL data we need to write. We probably should still allocate them in 16MB segments. We rely on the size being fixed in a number of places. But it's probably worthwhile to just do a posix_fadvise or such. Also, if we continually increase the size with each write we end up doing a lot more metadata transactions, which'll essentially serve to increase journalling overhead further. Greetings, Andres Freund
Re: Pluggable Storage - Andres's take
Hi, I've pushed up a new version to https://github.com/anarazel/postgres-pluggable-storage which now passes all the tests. Besides a lot of bugfixes, I've rebased the tree, moved TriggerData to be primarily slot based (with a conversion roundtrip when calling trigger functions), and a lot of other small things. On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote: > On Tue, Jul 3, 2018 at 5:06 PM Andres Freund wrote: > > The current state of my version of the patch is *NOT* ready for proper > > review (it doesn't even pass all tests, there's FIXME / elog()s). But I > > think it's getting close enough to it's eventual shape that more eyes, > > and potentially more hands on keyboards, can be useful. > > > > I will try to update it to make sure that it passes all the tests and also > try to > reduce the FIXME's. Cool. Alexander, Haribabu, if you give me (privately) your github accounts, I'll give you write access to that repository. > > The most fundamental issues I had with Haribabu's last version from [2] > > are the following: > > > > - The use of TableTuple, a typedef from void *, is bad from multiple > > fronts. For one it reduces just about all type safety. There were > > numerous bugs in the patch where things were just cast from HeapTuple > > to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's > > a really, really bad idea to introduce a vague type like this for > > development purposes alone, it makes it way too hard to refactor - > > essentially throwing the biggest benefit of type safe languages out of > > the window. > > > > My earlier intention was to remove the HeapTuple usage entirely and replace > it with slot everywhere outside the tableam. But it ended up with TableTuple > before it reach to the stage because of heavy use of HeapTuple. I don't think that's necessary - a lot of the system catalog accesses are going to continue to be heap tuple accesses. And the conversions you did significantly continue to access TableTuples as heap tuples - it was just that the compiler didn't warn about it anymore. A prime example of that is the way the rewriteheap / cluster integreation was done. Cluster continued to sort tuples as heap tuples - even though that's likely incompatible with other tuple formats which need different state. > > Additionally I think it's also the wrong approach architecturally. We > > shouldn't assume that a tuple can efficiently be represented as a > > single palloc'ed chunk. In fact, we should move *away* from relying on > > that so much. > > > > I've thus removed the TableTuple type entirely. > > > > Thanks for the changes, I didn't check the code yet, so for now whenever the > HeapTuple is required it will be generated from slot? Pretty much. > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > > a higher memory usage, because the resulting tuples weren't freed or > > anything. There might be a reason for doing such a change - we've > > certainly discussed that before - but I'm *vehemently* against doing > > that at the same time we introduce pluggable storage. Analyzing the > > performance effects will be hard enough without changes like this. > > > > How about using of slot instead of tuple and reusing of it? I don't know > yet whether it is possible everywhere. Not quite sure what you mean? > > Tasks / Questions: > > > > - split up patch > > > > How about generating refactoring changes as patches first based on > the code in your repo as discussed here[1]? Sure - it was just at the moment too much work ;) > > - Change heap table AM to not allocate handler function for each table, > > instead allocate it statically. Avoids a significant amount of data > > duplication, and allows for a few more compiler optimizations. > > > > Some kind of static variable handlers for each tableam, but need to check > how can we access that static handler from the relation. I'm not sure what you mean by "how can we access"? We can just return a pointer from the constant data from the current handler? Except that adding a bunch of consts would be good, the external interface wouldn't need to change? > > - change scan level slot creation to use tableam function for doing so > > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid > > > > so with this there shouldn't be a way from slot to tid mapping or there > should be some other way. I'm not sure I follow? > - bitmap index scans probably need a new tableam.h callback, abstracting > > bitgetpage() > > > > OK. Any chance you could try to tackle this? I'm going to be mostly out this week, so we'd probably not run across each others feet... Greetings, Andres Freund
Re: Pluggable Storage - Andres's take
Hi, On 2018-07-05 15:25:25 +0300, Alexander Korotkov wrote: > > - I think the move of the indexing from outside the table layer into the > > storage layer isn't a good idea. It lead to having to pass EState into > > the tableam, a callback API to perform index updates, etc. This seems > > to have at least partially been triggered by the speculative insertion > > codepaths. I've reverted this part of the changes. The speculative > > insertion / confirm codepaths are now exposed to tableam.h - I think > > that's the right thing because we'll likely want to have that > > functionality across more than a single tuple in the future. > > I agree that passing EState into tableam doesn't look good. But I > believe that tableam needs way more control over indexes than it has > in your version patch. If even tableam-independent insertion into > indexes on tuple insert is more or less ok, but on update we need > something smarter rather than just insert index tuples depending > "update_indexes" flag. Tableam specific update method may decide to > update only some of indexes. For example, when zheap performs update > in-place then it inserts to only indexes, whose fields were updated. > And I think any undo-log based storage would have similar behavior. > Moreover, it might be required to do something with existing index > tuples (for instance, as I know, zheap sets "deleted" flag to index > tuples related to previous values of updated fields). I agree that we probably need more - I'm just inclined to think that we need a more concrete target to work against. Currently zheap's indexing logic still is fairly naive, I don't think we'll get the interface right without having worked further on the zheap side of things. > > - The visibility functions relied on the *caller* performing buffer > > locking. That's not a great idea, because generic code shouldn't know > > about the locking scheme a particular AM needs. I've changed the > > external visibility functions to instead take a slot, and perform the > > necessary locking inside. > > Makes sense for me. But would it cause extra locking/unlocking and in > turn performance impact? I don't think so - nearly all the performance relevant cases do all the visibility logic inside the AM. Where the underlying functions can be used, that do not do the locking. Pretty all the converted places just had manual LockBuffer calls. > > - HOT was encoded in the API in a bunch of places. That doesn't look > > right to me. I tried to improve a bit on that, but I'm not yet quite > > sure I like it. Needs written explanation & arguments... > > Yes, HOT is heapam specific feature. Other tableams might not have > HOT. But it appears that we still expose hot_search_buffer() function > in tableam API. But that function have no usage, so it's just > redundant and can be removed. Yea, that was a leftover. > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > > a higher memory usage, because the resulting tuples weren't freed or > > anything. There might be a reason for doing such a change - we've > > certainly discussed that before - but I'm *vehemently* against doing > > that at the same time we introduce pluggable storage. Analyzing the > > performance effects will be hard enough without changes like this. > > I think once we switched to slots, doing heap_copytuple() do > frequently is not required anymore. It's mostly gone now. > > - I've for now backed out the heap rewrite changes, partially. Mostly > > because I didn't like the way the abstraction looks, but haven't quite > > figured out how it should look like. > > Yeah, it's hard part, but we need to invent something in this area... I agree. But I really don't yet quite now what. I somewhat wonder if we should just add a cluster_rel() callback to the tableam and let it deal with everything :(. As previously proposed the interface wouldn't have worked with anything not losslessly encodable into a heaptuple, which is unlikely to be sufficient. FWIW, I plan to be mostly out until Thursday this week, and then I'll rebase onto the new version of the abstract slot patch and then try to split up the patchset. Once that's done, I'll do a prototype conversion of zheap, which I'm sure will show up a lot of weaknesses in the current abstraction. Once that's done I hope we can collaborate / divide & conquer to get the individual pieces into commit shape. If either of you wants to get a head start separating something out, let's try to organize who would do what? The EPQ and trigger slotification are probably good candidates. Greetings, Andres Freund
Re: Usage of epoch in txid_current
Andres Freund writes: > On 2018-07-15 16:41:35 -0400, Tom Lane wrote: >> Andres Freund writes: >>> On 2018-07-09 19:56:25 -0400, Tom Lane wrote: Or, perhaps, use a struct in assert builds and int64 otherwise? You could hide the ensuing notational differences in macros. >> [ bunch of test results ] >> Offhand it would seem that we can get away with struct wrappers >> on any platform where performance is really of concern today. > Cool, thanks for checking! BTW, independently of any performance questions, these results show that my idea above was untenable anyway. On those platforms where there is a codegen difference, doing it like that would have resulted in an ABI difference between regular and assert builds. That's something to avoid, at least in any API that's visible to extension modules --- we've had project policy for some time that it should be possible to use non-assert extensions with assert-enabled core and vice versa. Conceivably we could have used the struct API only under a special devel flag that few people use except a buildfarm animal or two. But that's just a pain in the rear. regards, tom lane
Re: Parallel queries in single transaction
> Well, sure. But you could just as well open multiple connections and > make the queries concurrent that way. Or change the GUC to increase the > number of workers for the nightly ETL. This is an option right now for having permanent staging tables for future join. I mistakenly said ETL while it is ELT what means that most of operations are in the database so we try to keep all changes in db code instead of changing engine for execution. In PG11 we have parallel CTAS what is drammatical improvement for us, but there are still will be operations(query plans) which are not parallel. Having postgresql completely ACID is amazing feature, so when we need to do ELT operation outside the transaction and guarantee that ELT job completed successfully by checking that all steps(multiple transactions with staging tables) are succeeded(with graceful rollback + cleanup in case of failure), makes things more complex. Indeed I still agree that it is possible to workaround by operating on application level. -P -P On Mon, Jul 16, 2018 at 2:28 PM Tomas Vondra wrote: > > > On 07/16/2018 12:03 PM, Paul Muntyanu wrote: > > Hi Tomas, thanks for looking into. I am more talking about queries which > > can not be optimized, e.g. > > * fullscan of the table and heavy calculations for another one. > > * query through FDW for both queries(e.g. one query fetches data from > > Kafka and another one is fetching from remote Postgres. There are no > > bounds for both queries for anything except local CPU, network and > > remote machine) > > > > IO bound is not a problem in case if you have multiple tablesapces. > > But it was you who mentioned "query stuck" not me. I merely pointed out > that in such cases running queries concurrently won't help. > > > And CPU bound can be not the case when you have 32 cores and 6 max > workers > > per query. Then, during nigtly ETL, I do not have anything except single > > query running) == 6 cores are occupied. If I can run queries in > > parallel, I would occupy two IO stacks(two tablespaces) + 12 cores > > instead of sequentially 6 and then again 6. > > > > Well, sure. But you could just as well open multiple connections and > make the queries concurrent that way. Or change the GUC to increase the > number of workers for the nightly ETL. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: patch to allow disable of WAL recycling
Andres Freund writes: > On 2018-07-15 20:55:38 -0400, Tom Lane wrote: >> That's not the way to think about it. On a COW file system, we don't >> want to "create 16MB files" at all --- we should just fill WAL files >> on-the-fly, because the pre-fill activity isn't actually serving the >> intended purpose of reserving disk space. It's just completely useless >> overhead :-(. So we can't really make a direct comparison between the >> two approaches; there's no good way to net out the cost of constructing >> the WAL data we need to write. > We probably should still allocate them in 16MB segments. We rely on the > size being fixed in a number of places. Reasonable point. I was supposing that it'd be okay if a partially written segment were shorter than 16MB, but you're right that that would require vetting a lot of code to be sure about it. > But it's probably worthwhile to > just do a posix_fadvise or such. Also, if we continually increase the > size with each write we end up doing a lot more metadata transactions, > which'll essentially serve to increase journalling overhead further. Hm. What you're claiming is that on these FSen, extending a file involves more/different metadata activity than allocating new space for a COW overwrite of an existing area within the file. Is that really true? The former case would be far more common in typical usage, and somehow I doubt the FS authors would have been too stupid to optimize things so that the same journal entry can record both the space allocation and the logical-EOF change. But anyway, this means we have two nearly independent issues to investigate: whether recycling/renaming old files is cheaper than constantly creating and deleting them, and whether to use physical file zeroing versus some "just set the EOF please" filesystem call when first creating a file. The former does seem like it's purely a performance question, but the latter involves a tradeoff of performance against an ENOSPC-panic protection feature that in reality only works on some filesystems. regards, tom lane
Re: cursors with prepared statements
On Mon, Jul 16, 2018 at 8:56 AM, Peter Eisentraut wrote: >> The attached patch seems to do the trick, of allowing EXECUTE + USING. >> I'm not sure this is worth the trouble, though, since EXECUTE as a plain >> SQL command is a PostgreSQL-extension anyway. > > I think it's a PostgreSQL extension that we allow just about anything to > be executed directly. So we should still use the standard syntax either > way. It would be weird if EXECUTE or any other command had different > syntax in direct SQL, ECPG, PL/pgSQL, etc. We have some differences > already, but we shouldn't create more. Hmm. Your proposal to attach a USING clause to DECLARE .. CURSOR FOR rather than inventing an OPEN command is an argument for a PostgreSQL syntax extension, but your proposal to write DECLARE .. CURSOR FOR rather than DECLARE .. CURSOR FOR EXECUTE is an argument for standard syntax over and against a PostgreSQL extension. That sounds a little contradictory, but I think I agree with it. If we allow a USING clause for DECLARE .. CURSOR FOR, that doesn't prevent somebody from inventing an OPEN command in the future. As part of introducing such an OPEN command, DECLARE .. CURSOR FOR could be made not to fail if the prepared statement has parameters but no USING commands. On the other hand, if we insist on injecting the word EXECUTE into the syntax as Heikki proposes, that's purely and simply an incompatibility with the standard's syntax, as well as with what ECPG already does. So +1 for your position. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Vacuum: allow usage of more than 1GB of work mem
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > On 13/07/18 01:39, Andrew Dunstan wrote: > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > >>> On 2018-Jul-12, Andrew Dunstan wrote: > >>> > I fully understand. I think this needs to go back to "Waiting on > Author". > >>> Why? Heikki's patch applies fine and passes the regression tests. > >> > >> Well, I understood Claudio was going to do some more work (see > >> upthread). > > > > Claudio raised a good point, that doing small pallocs leads to > > fragmentation, and in particular, it might mean that we can't give > > back the memory to the OS. The default glibc malloc() implementation > > has a threshold of 4 or 32 MB or something like that - allocations > > larger than the threshold are mmap()'d, and can always be returned to > > the OS. I think a simple solution to that is to allocate larger > > chunks, something like 32-64 MB at a time, and carve out the > > allocations for the nodes from those chunks. That's pretty > > straightforward, because we don't need to worry about freeing the > > nodes in retail. Keep track of the current half-filled chunk, and > > allocate a new one when it fills up. > > > Google seems to suggest the default threshold is much lower, like 128K. > Still, making larger allocations seems sensible. Are you going to work > on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
Re: patch to allow disable of WAL recycling
On Mon, Jul 16, 2018 at 10:12 AM, Tom Lane wrote: > But anyway, this means we have two nearly independent issues to > investigate: whether recycling/renaming old files is cheaper than > constantly creating and deleting them, and whether to use physical > file zeroing versus some "just set the EOF please" filesystem call > when first creating a file. The former does seem like it's purely > a performance question, but the latter involves a tradeoff of > performance against an ENOSPC-panic protection feature that in > reality only works on some filesystems. It's been a few years since I tested this, but my recollection is that if you fill up pg_xlog, the system will PANIC and die on a vanilla Linux install. Sure, you can set max_wal_size, but that's a soft limit, not a hard limit, and if you generate WAL faster than the system can checkpoint, you can overrun that value and force allocation of additional WAL files. So I'm not sure we have any working ENOSPC-panic protection today. Given that, I'm doubtful that we should prioritize maintaining whatever partially-working protection we may have today over raw performance. If we want to fix ENOSPC on pg_wal = PANIC, and I think that would be a good thing to fix, then we should do it either by finding a way to make the WAL insertion ERROR out instead of panicking, or throttle WAL generation as we get close to disk space exhaustion so that the checkpoint has time to complete, as previously proposed by Heroku. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Andres Freund writes: > On 2018-07-15 18:48:43 -0400, Tom Lane wrote: >> So basically, WAL replay hits an error while holding a buffer pin, and >> nothing is done to release the buffer pin, but AtProcExit_Buffers thinks >> something should have been done. > I think there's a few other cases where we hit this. I've seen something > similar from inside checkpointer / BufferSync(). I'd be surprised if > bgwriter couldn't be triggered into the same. Hm, yeah, on reflection it's pretty obvious that those are hazard cases. > I'm pretty sure that we do *not* force a panic on all nonzero-exit-code > cases for other subprocesses. That's my recollection as well -- mostly, we just start a new one. So I said I didn't want to do extra work on this, but I am looking into fixing it by having these aux process types run a ResourceOwner that can be told to clean up any open buffer pins at exit. We could be sure the coverage is complete by dint of removing the special-case code in resowner.c that allows buffer pins to be taken with no active resowner. Then CheckForBufferLeaks can be left as-is, ie something we do only in assert builds. regards, tom lane
Re: Libpq support to connect to standby server as priority
Haribabu Kommi wrote: > > On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe > > wrote: > > > Haribabu Kommi wrote: > > > > > > - I think the construction with "read_write_host_index" makes the code > > > even more > > > complicated than it already is. > > > > > > What about keeping the first successful connection open and storing it > > > in a > > > variable if we are in "prefer-read" mode. > > > If we get the read-only connection we desire, close that cached > > > connection, > > > otherwise use it. > > > > Even if we add a variable to cache the connection, I don't think the logic > > of checking > > the next host for the read-only host logic may not change, but the extra > > connection > > request to the read-write host again will be removed. > > I evaluated your suggestion of caching the connection and reuse it when there > is no > read only server doesn't find, but I am thinking that it will add more > complexity and also > the connection to the other servers delays, the cached connection may be > closed by > the server also because of timeout. > > I feel the extra time during connection may be fine, if user is preferring > the prefer-read > mode, instead of adding more complexity in handling the cached connection? > > comments? I tested the new patch, and it works as expected. I don't think that time-out of the cached session is a valid concern, because that would have to be a really short timeout. On the other hand, establishing the connection twice (first to check if it is read-only, then again because no read-only connection is found) can be quite costly. But that is a matter of debate, as is the readability of the code. Since I don't think I can contribute more to this patch, I'll mark it as ready for committer. Yours, Laurenz Albe
Re: GiST VACUUM
On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas wrote: > I'm still a bit scared about using pd_prune_xid to store the XID that > prevents recycling the page too early. Can we use some field in > GISTPageOpaqueData for that, similar to how the B-tree stores it in > BTPageOpaqueData? What's your reason for being scared? It seems to me that the alternatives being proposed (altering the size of the special space, or sometimes repurposing a field for some other purpose) have their own associated scariness. If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece of seemingly heap-specific data that is kept in the generic page header rather than in the heap's special space. Other AMs like btree or zheap may have different needs; one size does not fit all. That said, since getting rid of pd_prune_xid seems impractical, maybe it makes more sense to reuse it than to insist on leaving it idle and consuming space elsewhere in the page. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Sun, Jul 15, 2018 at 1:43 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Partitions: test11 FOR VALUES FROM (0) TO (100), > test12 FOR VALUES FROM (100) TO (200), > test13 FOR VALUES FROM (200) TO (300) > > Partitions: test21 FOR VALUES FROM (10) TO (110), > test22 FOR VALUES FROM (110) TO (210), > test23 FOR VALUES FROM (210) TO (310) > > I'm confused, since there is only one-to-one mapping of partitions. No, test21 would have to be joined to both test11 and test12, since either could contain matching rows. Also, test22 would have to be joined to both test12 and test13. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Make foo=null a warning by default.
On Mon, Jul 16, 2018 at 3:49 AM, Heikki Linnakangas wrote: > I don't agree with changing the default to 'warn'. "foo = NULL" is perfectly > legal SQL, even if it's not very useful in practice. +1. I think that will probably generate lots of annoying warnings in server logs compared to the value it adds. I don't object to having an optional feature to generate warnings, but I think it should be something a particular user needs to activate, not something we enable by default. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Make foo=null a warning by default.
Heikki Linnakangas writes: > On 16/07/18 04:40, David Fetter wrote: >> Per a discussion with Andrew Gierth and Vik Fearing, both of whom >> helped make this happen, please find attached a patch which makes it >> possible to get SQL standard behavior for "= NULL", which is an error. >> It's been upgraded to a warning, and can still be downgraded to >> silence (off) and MS-SQL-compatible behavior (on). > I don't agree with changing the default to 'warn'. "foo = NULL" is > perfectly legal SQL, even if it's not very useful in practice. I think that there's a very narrow argument to be made that SQL doesn't allow a NULL literal with context-determined type in this context. But we decided to generalize that restriction long ago, and suddenly deciding to enforce it only in this one context makes no sense to me. The idea that we would ever decide that it's an error seems flat out ridiculous. TBH I'm not really excited about investing any work in this area at all. Considering how seldom we hear any questions about transform_null_equals anymore[1], I'm wondering if we couldn't just rip the "feature" out entirely. But to the extent that we want to leave it in place, it's 100% for backwards compatibility, and that is a strong argument against changing anything about it. I'm also pretty dubious that issuing a warning here will accomplish much to help anyone find bugs. There are too many small variants of the same problem that it will not catch[2]. regards, tom lane [1] In a quick search, the most recent discussion I could find was from 2011: https://www.postgresql.org/message-id/flat/201110070729.p977tdcx075...@wwwmaster.postgresql.org Before that it came up maybe once a year or so, but it's just fallen off a cliff since then. I wonder whether Microsoft fixed Access to not need it. [2] Compare for instance the discussion about bug #6064, https://www.postgresql.org/message-id/flat/4DFE481F02250003E8EA%40gw.wicourts.gov A very large fraction of the pre-2011 threads mentioning transform_null_equals are essentially of the form "why didn't/can't transform_null_equals fix this bad code for me?". Each of those cases would also be a case that the proposed warning doesn't catch.
Re: Make foo=null a warning by default.
On 16/07/18 18:10, Tom Lane wrote: Heikki Linnakangas writes: On 16/07/18 04:40, David Fetter wrote: Per a discussion with Andrew Gierth and Vik Fearing, both of whom helped make this happen, please find attached a patch which makes it possible to get SQL standard behavior for "= NULL", which is an error. It's been upgraded to a warning, and can still be downgraded to silence (off) and MS-SQL-compatible behavior (on). I don't agree with changing the default to 'warn'. "foo = NULL" is perfectly legal SQL, even if it's not very useful in practice. I think that there's a very narrow argument to be made that SQL doesn't allow a NULL literal with context-determined type in this context. But we decided to generalize that restriction long ago, and suddenly deciding to enforce it only in this one context makes no sense to me. The idea that we would ever decide that it's an error seems flat out ridiculous. TBH I'm not really excited about investing any work in this area at all. Considering how seldom we hear any questions about transform_null_equals anymore[1], I'm wondering if we couldn't just rip the "feature" out entirely. Yeah, I was wondering about that too. But Fabien brought up a completely new use-case for this: people learning SQL. For beginners who don't understand the behavior of NULLs yet, I can see a warning or error being useful training wheels. Perhaps a completely new "training_wheels=on" option, which could check may for many other beginner errors, too, would be better for that. - Heikki
Re: [HACKERS] logical decoding of two-phase transactions
Hi Nikhil, I've been looking at this patch series, and I do have a bunch of comments and questions, as usual ;-) Overall, I think it's clear the main risk associated with this patch is the decode group code - it touches PROC entries, so a bug may cause trouble pretty easily. So I've focused on this part, for now. 1) LogicalLockTransaction does roughly this ... if (MyProc->decodeGroupLeader == NULL) { leader = AssignDecodeGroupLeader(txn->xid); if (leader == NULL || !BecomeDecodeGroupMember((PGPROC *)leader, txn->xid)) goto lock_cleanup; } leader = BackendXidGetProc(txn->xid); if (!leader) goto lock_cleanup; leader_lwlock = LockHashPartitionLockByProc(leader); LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); pgxact = &ProcGlobal->allPgXact[leader->pgprocno]; if(pgxact->xid != txn->xid) { LWLockRelease(leader_lwlock); goto lock_cleanup; } ... I wonder why we need the BackendXidGetProc call after the first if block. Can we simply grab MyProc->decodeGroupLeader at that point? 2) InitProcess now resets decodeAbortPending/decodeLocked flags, while checking decodeGroupLeader/decodeGroupMembers using asserts. Isn't that a bit strange? Shouldn't it do the same thing with both? 3) A comment in ProcKill says this: * Detach from any decode group of which we are a member. If the leader * exits before all other group members, its PGPROC will remain allocated * until the last group process exits; that process must return the * leader's PGPROC to the appropriate list. So I'm wondering what happens if the leader dies before other group members, but the PROC entry gets reused for a new connection. It clearly should not be a leader for that old decode group, but it may need to be a leader for another group. 4) strange hunk in ProcKill There seems to be some sort of merge/rebase issue, because this block of code (line ~880) related to lock groups /* Return PGPROC structure (and semaphore) to appropriate freelist */ proc->links.next = (SHM_QUEUE *) *procgloballist; *procgloballist = proc; got replaced by code relared to decode groups. That seems strange. 5) ReorderBufferCommitInternal I see the LogicalLockTransaction() calls in ReorderBufferCommitInternal have vastly variable comments. Some calls have no comment, some calls have "obvious" comment like "Lock transaction before catalog access" and one call has this very long comment /* * Output plugins can access catalog metadata and we * do not have any control over that. We could ask * them to call * LogicalLockTransaction/LogicalUnlockTransaction * APIs themselves, but that leads to unnecessary * complications and expectations from plugin * writers. We avoid this by calling these APIs * here, thereby ensuring that the in-progress * transaction will be around for the duration of * the apply_change call below */ I find that rather inconsistent, and I'd say those comments are useless. I suggest to remove all the per-call comments and instead add a comment about the locking into the initial file-level comment, which already explains handling of large transactions, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)
On Sun, Jul 15, 2018 at 10:10 PM, Michael Paquier wrote: > On Fri, Jul 13, 2018 at 04:57:59PM -0500, Robert Haas wrote: >> On Mon, Jul 9, 2018 at 4:41 PM, Michael Paquier wrote: >>> Another idea that I have here, is to rework the page for monitoring >>> stats so as we create one sub-section for each system view, and also one >>> for the table of wait events. For the wait events, we could then >>> completely remove the first category column which has morerows and >>> divide the section into on table per event category. >> >> +1 from me. I think I proposed that before. > > Attached is a proof of concept of that. I have divided the "Viewing > Statistics" section into a subset for each catalog, and each wait event > type gains its sub-section as well. There is a bit more to do with the > indentation and some xreflabels, but I think that this is enough to > begin a discussion. > > Thoughts? This doesn't seem to get rid of the morerows stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Logical decoding from promoted standby with same replication slot
On Fri, Jul 13, 2018 at 2:30 PM, Jeremy Finzel wrote: > Hello - > > We are working on several DR scenarios with logical decoding. Although we > are using pglogical the question we have I think is generally applicable to > logical replication. > > Say we have need to drop a logical replication slot for some emergency > reason on the master, but we don't want to lose the data permanently. We > can make a point-in-time-recovery snapshot of the master to use in order to > recover the lost data in the slot we are about to drop. Then we drop the > slot on master. > > We can then point our logical subscription to pull from the snapshot to > get the lost data, once we promote it. > > The question is that after promotion, logical decoding is looking for a > timeline 2 file whereas the file is still at timeline 1. > > The WAL file is 000108FD003C, for example. After promotion, > it is still 000108FD003C in pg_wal. But logical decoding says > ERROR: segment 000208FD003C has already been removed (it is > looking for a timeline 2 WAL file). Simply renaming the file actually > allows us to stream from the replication slot accurately and recover the > data. > > But all of this begs the question of an easier way to do this - why > doesn't logical decoding know to look for a timeline 1 file? It is really > helpful to have this ability to easily recover logical replicated data from > a snapshot of a replication slot, in case of disaster. > > All thoughts very welcome! > > Thanks, > Jeremy > I'd like to bump this question with some elaboration on my original question: is it possible to do a *controlled* failover reliably with logical decoding, assuming there are unconsumed changes in the replication slot that client still needs? It is rather easy to do a controlled failover if we can verify there are no unconsumed changes in the slot before failover. Then, we just recreate the slot on the promoted standby while clients are locked out, and we have not missed any data changes. I am trying to figure out if the problem of following timelines, as per this wiki for example: https://wiki.postgresql.org/wiki/Failover_slots, can be worked around in a controlled scenario. One additional part of this is that after failover I have 2 WAL files with the same walfile name but on differing timelines, and the promoted standby is only going to decode from the latter. Does that mean I am likely to lose data? Part of the reason I ask is because in testing, I have NOT lost data in doing a controlled failover as described above (i.e. with unconsumed changes in the slot that I need to replay on promoted standby). I am trying to figure out if I've gotten lucky or if this method is actually reliable. That is, renaming the WAL files to bump the timeline, since these WAL files are simply identical to the ones that were played on the master, and thus ought to show the same logical decoding information to be consumed. Thank you! Jeremy
Re: New GUC to sample log queries
On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing wrote: > Hmm. Not sure if that last word should be _sample, _sampling, _rate, or > a combination of those. +1 for rate or sample_rate. I think "sample" or "sampling" without "rate" will not be very clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: psql tab completion for SELECT
Hello, postgres=# select partkey, partc[TAB] > [no completions] > >From the thread, I believe that this feature will be implemented in a after patch. > > And I'd love this case, where go back to edit the SELECT list, after > already typing the FROM part, to be smarter: > > postgres=# select p[TAB] FROM lp; > Display all 370 possibilities? (y or n) > I believe this would be a very interesting feature indeed. After playing alittle bit around with the patch I noticed that a comma was missing in line 1214 + 1202 /* min_server_version */ + 1203 9, + 1204 /* catname */ + 1205 "pg_catalog.pg_proc p", + 1206 /* selcondition */ + 1207 "p.prorettype NOT IN ('trigger'::regtype, 'internal'::regtype) " + 1208 "AND 'internal'::regtype != ALL (p.proargtypes) " + 1209 "AND p.oid NOT IN (SELECT unnest(array[typinput,typoutput,typreceive,typsend,typmodin,typmodout,typanalyze]) FROM pg_type) " + 1210 "AND p.oid NOT IN (SELECT unnest(array[aggtransfn,aggfinalfn]) FROM pg_aggregate) " + 1211 "AND p.oid NOT IN (SELECT unnest(array[oprcode,oprrest,oprjoin]) FROM pg_operator) " + 1212 "AND p.oid NOT IN (SELECT unnest(array[lanplcallfoid,laninline,lanvalidator]) FROM pg_language) " + 1213 "AND p.oid NOT IN (SELECT castfunc FROM pg_cast) " + 1214 "AND p.oid NOT IN (SELECT amproc FROM pg_amproc) " + 1215 /* viscondition */ + 1216 "pg_catalog.pg_function_is_visible(p.oid)", To catch these typos would be good if we could get some testing around psql. (Newbie question: do we have any kind of testing around tools like psql?) Thanks Joao
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
On 12/07/18 21:27, Fabien COELHO wrote: For the testing, we just need to make sure that at least one progress report is always printed, if -P is used. Right? Yep. That is the first condition above the last_report is set to thread_start meaning that there has been no report. So where does the 0.5 second rule come in? Can't we just do "if (no progress reports were printed) { print progress report; }" at the end? The second 0.5s condition is to print a closing report if some time significant time passed since the last one, but we do not want to print a report at the same second. pgbench -T 5 -P 2 Would then print report at 2, 4 and 5. 0.5 ensures that we are not going to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly. If you do not like it then the second condition can be removed, fine with me. As the code stands, you would get reports at "2 4.0[00]", right? Let's keep it that way. I think the only change we need to make in the logic is to check at the end, if *any* progress reports at all have been printed, and print one if not. And do that only when the -P option is smaller than the -T option, I suppose. It also adds a small feature which is that there is always a final progress when the run is completed, which can be useful when computing progress statistics, otherwise some transactions could not be reported in any progress at all. Any transactions in the last 0.5 seconds might still not get reported in any progress reports. Yep. I wanted a reasonable threshold, given that both -T and -P are in seconds anyway, so it probably could only happen with -t. Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5. Actually, it seems to be acceptd, but it's truncated down to the nearest integer. That's not very nice :-(. But it's a separate issue. Indeed… but then throttling would not be tested:-) The point of the test is to exercise all time-related options, including throttling with a reasonable small value. Ok. I don't think that's really worthwhile. If we add some code that only runs in testing, then we're not really testing the real thing. I wouldn't trust the test to tell much. Let's just leave out that magic environment variable thing, and try to get the rest of the patch finished. If you remove the environment, then some checks need to be removed, because the 2 second run may be randomly shorten when there is nothing to do. If not, the test will fail underterminiscally, which is not acceptable. Hence the hack. I agree that it is not beautiful. The more reasonable alternative could be to always last 2 seconds under -T 2, even if the execution can be shorten because there is nothing to do at all, i.e. remove the environment-based condition but keep the sleep. That sounds reasonable. It's a bit silly to wait when there's nothing to do, but it's also weird if the test exits before the specified time is up. Seems less surprising to always sleep. - Heikki
Re: Vacuum: allow usage of more than 1GB of work mem
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > wrote: > > > > > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > > On 13/07/18 01:39, Andrew Dunstan wrote: > > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > > >>> On 2018-Jul-12, Andrew Dunstan wrote: > > >>> > > I fully understand. I think this needs to go back to "Waiting on > > Author". > > >>> Why? Heikki's patch applies fine and passes the regression tests. > > >> > > >> Well, I understood Claudio was going to do some more work (see > > >> upthread). > > > > > > Claudio raised a good point, that doing small pallocs leads to > > > fragmentation, and in particular, it might mean that we can't give > > > back the memory to the OS. The default glibc malloc() implementation > > > has a threshold of 4 or 32 MB or something like that - allocations > > > larger than the threshold are mmap()'d, and can always be returned to > > > the OS. I think a simple solution to that is to allocate larger > > > chunks, something like 32-64 MB at a time, and carve out the > > > allocations for the nodes from those chunks. That's pretty > > > straightforward, because we don't need to worry about freeing the > > > nodes in retail. Keep track of the current half-filled chunk, and > > > allocate a new one when it fills up. > > > > > > Google seems to suggest the default threshold is much lower, like 128K. > > Still, making larger allocations seems sensible. Are you going to work > > on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > the dynamic threshold is set to the size of the freed block. > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else.
Re: Make foo=null a warning by default.
Heikki Linnakangas writes: > On 16/07/18 18:10, Tom Lane wrote: >> TBH I'm not really excited about investing any work in this area at all. >> Considering how seldom we hear any questions about transform_null_equals >> anymore[1], I'm wondering if we couldn't just rip the "feature" out >> entirely. > Yeah, I was wondering about that too. But Fabien brought up a completely > new use-case for this: people learning SQL. For beginners who don't > understand the behavior of NULLs yet, I can see a warning or error being > useful training wheels. Perhaps a completely new "training_wheels=on" > option, which could check may for many other beginner errors, too, would > be better for that. Agreed --- but what we'd want that to do seems only vaguely related to the existing behavior of transform_null_equals. As an example, we intentionally made transform_null_equals *not* trigger on CASE x WHEN NULL THEN ... but a training-wheels warning for that would likely be reasonable. For that matter, many of the old threads about this are complaining about nulls that aren't simple literals in the first place. I wonder whether a training-wheels feature that whined *at runtime* about null WHERE-qual or case-test results would be more useful than a parser check. regards, tom lane
Re: Vacuum: allow usage of more than 1GB of work mem
On 16/07/18 18:35, Claudio Freire wrote: On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire wrote: On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else. Yeah, we basically want to be well above whatever the threshold is. I don't think we should try to check for any specific constant, just make it large enough. Different libc implementations might have different policies, too. There's little harm in overshooting, and making e.g. 64 MB allocations when 1 MB would've been enough to trigger the mmap() behavior. It's going to be more granular than the current situation, anyway, where we do a single massive allocation. (A code comment to briefly mention the thresholds on common platforms would be good, though). - Heikki
Re: Make foo=null a warning by default.
On Mon, Jul 16, 2018 at 09:49:14AM +0200, Fabien COELHO wrote: > Hello David, > > >Per a discussion with Andrew Gierth and Vik Fearing, both of whom > >helped make this happen, please find attached a patch which makes it > >possible to get SQL standard behavior for "= NULL", which is an error. > >It's been upgraded to a warning, and can still be downgraded to > >silence (off) and MS-SQL-compatible behavior (on). > > That's nice, and good for students, and probably others as well:-) > > A few comments: > > Patch applies & compiles, "make check" ok. > > #backslash_quote = safe_encoding# on, off, or safe_encoding > [...] > #transform_null_equals = warn Fixed. > I think it would be nice to have the possible values as a comment in > "postgresql.conf". > > * Code: > > -bool operator_precedence_warning = false; > +bool operator_precedence_warning = false; > > Is this reindentation useful/needed? I believe so because it fits with the line just below it. > + parser_errposition(pstate, a->location))); > + if (need_transform_null_equals && transform_null_equals == > TRANSFORM_NULL_EQUALS_ON) > > For consistency, maybe skip a line before the if? Fixed. > transform_null_equals_options[] = { [...] > > I do not really understand the sort order of this array. Maybe it could be > alphabetical, or per value? Or maybe it is standard values and then > synonyms, in which is case maybe a comment on the end of the list. I've changed it to per value. Is this better? > Guc help text: ISTM that it should spell out the possible values and their > behavior. Maybe something like: > > """ > When turned on, turns expr = NULL into expr IS NULL. > With off, warn or error, do not do so and be silent, issue a warning or an > error. > The standard behavior is not to perform this transformation. > Default is warn. > """ > > Or anything in better English. I've changed this, and hope it's clearer. > * Test > > +select 1=null; > + ?column? > > Maybe use as to describe the expected behavior, so that it is easier to > check, and I think that "?column?" looks silly eg: > > select 1=null as expect_{true,false}[_and_a_warning/error]; Changed to more descriptive headers. >create table cnullchild (check (f1 = 1 or f1 = null)) > inherits(cnullparent); > +WARNING: = NULL can only produce a NULL > > I'm not sure of this test case. Should it be turned into "is null"? This was just adjusting the existing test output to account for the new warning. I presume it was phrased that way for a reason. > Maybe the behavior could be retested after the reset? > > * Documentation: > > The "error" value is not documented. > > More generally, The value is said to be an enum, but the list of values is > not listed anywhere in the documentation. > > ISTM that the doc would be clearer if for each of the four values of the > parameter the behavior is spelled out. > > Maybe "warn" in the doc should be warn or something. Fixed. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 33b9f51b1374ede71a6479d44adfa0588c7fb4c3 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Jul 2018 16:11:08 -0700 Subject: [PATCH] Make foo=null a warning by default v2 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.17.1" This is a multi-part message in MIME format. --2.17.1 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit The transform_null_equals GUC is now an enum consisting of warn, error, off, on. --- doc/src/sgml/config.sgml | 20 ++--- src/backend/parser/parse_expr.c | 30 ++--- src/backend/utils/misc/guc.c | 44 +-- src/backend/utils/misc/postgresql.conf.sample | 2 +- src/include/parser/parse_expr.h | 11 - src/test/regress/expected/guc.out | 30 + src/test/regress/expected/inherit.out | 1 + src/test/regress/sql/guc.sql | 18 8 files changed, 129 insertions(+), 27 deletions(-) --2.17.1 Content-Type: text/x-patch; name="0001-Make-foo-null-a-warning-by-default.patch" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="0001-Make-foo-null-a-warning-by-default.patch" diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e307bb4e8e..626a2d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8036,7 +8036,7 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' - transform_null_equals (boolean) + transform_null_equals (enum) IS NULL transform_null_equals configuration parameter @@ -8044,15 +8044,23 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' -When on,
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra wrote: > Overall, I think it's clear the main risk associated with this patch is the > decode group code - it touches PROC entries, so a bug may cause trouble > pretty easily. So I've focused on this part, for now. I agree. As a general statement, I think the idea of trying to prevent transactions from aborting is really scary. It's almost an axiom of the system that we're always allowed to abort, and I think there could be a lot of unintended and difficult-to-fix consequences of undermining that guarantee. I think it will be very difficult to create a sound system for delaying transactions, and I doubt very much that the proposed system is sound. In particular: - The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes it interruptible, then it's possible for the abort to complete before the decoding processes have aborted. If that can happen, then this whole mechanism is completely pointless, because it fails to actually achieve the guarantee which is its central goal. On the other hand, if you don't make this abort interruptible, then you are significantly increase the risk that a backend could get stuck in the abort path for an unbounded period of time. If the aborting backend holds any significant resources at this point, such as heavyweight locks, then you risk creating a deadlock that cannot be broken until the decoding process manages to abort, and if that process is involved in the deadlock, then you risk creating an unbreakable deadlock. - BackendXidGetProc() seems to be called in multiple places without any lock held. I don't see how that can be safe, because AFAICS it must inevitably introduce a race condition: the answer can change after that value is returned but before it is used. There's a bunch of recheck logic that looks like it is trying to cope with this problem, but I'm not sure it's very solid. For example, AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding any lock; we have historically avoided assuming that pointer-width reads cannot be torn. (We have assumed this only for 4-byte reads or narrower.) There are no comments about the locking hazards here, and no real explanation of how the recheck algorithm tries to patch things up: + leader = BackendXidGetProc(xid); + if (!leader || leader != proc) + { + LWLockRelease(leader_lwlock); + return NULL; + } Can be non-NULL yet unequal to proc? I don't understand how that can happen: surely once the PGPROC that has that XID aborts, the same XID can't possibly be assigned to a different PGPROC. - The code for releasing PGPROCs in ProcKill looks completely unsafe to me. With locking groups for parallel query, a process always enters a lock group of its own volition. It can safely use (MyProc->lockGroupLeader != NULL) as a race-free test because no other process can modify that value. But in this implementation of decoding groups, one process can put another process into a decoding group, which means this test has a race condition. If there's some reason this is safe, the comments sure don't explain it. I don't want to overplay my hand, but I think this code is a very long way from being committable, and I am concerned that the fundamental approach of blocking transaction aborts may be unsalvageably broken or at least exceedingly dangerous. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Another usability issue with our TAP tests
Since "make check-world" is rather chatty, if you get a failure while running it under high parallelism, the location of the failure has often scrolled off the terminal window by the time all the other subjobs exit. This is not a huge problem for tests using our traditional infrastructure, because you can just run "git status" to look for regression.diffs files. But a TAP test failure leaves nothing behind that git will consider unusual. I've repeatedly had to run check-world with no parallelism (wasting many minutes) in order to locate which test actually failed. I'm not sure about a good way to improve this. One idea that comes to mind is to tweak the "make check" rules so that the tmp_check subdirectories are automatically deleted on successful completion, but not on failure, and then remove tmp_check from the .gitignore lists. But the trouble with that is sometimes you want to look at the test logs afterwards, even when make thought the test succeeded. Ideas? regards, tom lane
Re: [HACKERS] logical decoding of two-phase transactions
Robert Haas writes: > I agree. As a general statement, I think the idea of trying to > prevent transactions from aborting is really scary. It's almost an > axiom of the system that we're always allowed to abort, and I think > there could be a lot of unintended and difficult-to-fix consequences > of undermining that guarantee. I think it will be very difficult to > create a sound system for delaying transactions, and I doubt very much > that the proposed system is sound. Ugh, is this patch really dependent on such a thing? TBH, I think the odds of making that work are indistinguishable from zero; and even if you managed to commit something that did work at the instant you committed it, the odds that it would stay working in the face of later system changes are exactly zero. I would reject this idea out of hand. regards, tom lane
Re: GiST VACUUM
> 16 июля 2018 г., в 18:58, Robert Haas написал(а): > > On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas wrote: >> I'm still a bit scared about using pd_prune_xid to store the XID that >> prevents recycling the page too early. Can we use some field in >> GISTPageOpaqueData for that, similar to how the B-tree stores it in >> BTPageOpaqueData? > > What's your reason for being scared? It seems to me that the > alternatives being proposed (altering the size of the special space, > or sometimes repurposing a field for some other purpose) have their > own associated scariness. Thanks, that's exactly what I'm thinking about where to store this xid. Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way we store it easily. Best regards, Andrey Borodin. 0002-Physical-GiST-scan-during-VACUUM-v9.patch Description: Binary data 0001-Delete-pages-during-GiST-VACUUM-v9.patch Description: Binary data
Re: [HACKERS] logical decoding of two-phase transactions
On 07/16/2018 06:15 PM, Robert Haas wrote: On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra wrote: Overall, I think it's clear the main risk associated with this patch is the decode group code - it touches PROC entries, so a bug may cause trouble pretty easily. So I've focused on this part, for now. I agree. As a general statement, I think the idea of trying to prevent transactions from aborting is really scary. It's almost an axiom of the system that we're always allowed to abort, and I think there could be a lot of unintended and difficult-to-fix consequences of undermining that guarantee. I think it will be very difficult to create a sound system for delaying transactions, and I doubt very much that the proposed system is sound. In particular: - The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes it interruptible, then it's possible for the abort to complete before the decoding processes have aborted. If that can happen, then this whole mechanism is completely pointless, because it fails to actually achieve the guarantee which is its central goal. On the other hand, if you don't make this abort interruptible, then you are significantly increase the risk that a backend could get stuck in the abort path for an unbounded period of time. If the aborting backend holds any significant resources at this point, such as heavyweight locks, then you risk creating a deadlock that cannot be broken until the decoding process manages to abort, and if that process is involved in the deadlock, then you risk creating an unbreakable deadlock. I'm not sure I understand. Are you suggesting the process might get killed or something, thanks to the CHECK_FOR_INTERRUPTS() call? - BackendXidGetProc() seems to be called in multiple places without any lock held. I don't see how that can be safe, because AFAICS it must inevitably introduce a race condition: the answer can change after that value is returned but before it is used. There's a bunch of recheck logic that looks like it is trying to cope with this problem, but I'm not sure it's very solid. But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's true there are a few places where we do != NULL checks on the result without holding any lock, but I don't see why that would be a problem? And before actually inspecting the contents, the code always does LockHashPartitionLockByProc. But I certainly agree this would deserve comments explaining why this (lack of) locking is safe. (The goal why it's done this way is clearly an attempt to acquire the lock as infrequently as possible, in an effort to minimize the overhead.) For example, AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding any lock; we have historically avoided assuming that pointer-width reads cannot be torn. (We have assumed this only for 4-byte reads or narrower.) There are no comments about the locking hazards here, and no real explanation of how the recheck algorithm tries to patch things up: + leader = BackendXidGetProc(xid); + if (!leader || leader != proc) + { + LWLockRelease(leader_lwlock); + return NULL; + } Can be non-NULL yet unequal to proc? I don't understand how that can happen: surely once the PGPROC that has that XID aborts, the same XID can't possibly be assigned to a different PGPROC. Yeah. I have the same question. - The code for releasing PGPROCs in ProcKill looks completely unsafe to me. With locking groups for parallel query, a process always enters a lock group of its own volition. It can safely use (MyProc->lockGroupLeader != NULL) as a race-free test because no other process can modify that value. But in this implementation of decoding groups, one process can put another process into a decoding group, which means this test has a race condition. If there's some reason this is safe, the comments sure don't explain it. I don't follow. How could one process put another process into a decoding group? I don't think that's possible. I don't want to overplay my hand, but I think this code is a very long way from being committable, and I am concerned that the fundamental approach of blocking transaction aborts may be unsalvageably broken or at least exceedingly dangerous. I'm not sure about the 'unsalvageable' part, but it needs more work, that's for sure. Unfortunately, all previous attempts to make this work in various other ways failed (see past discussions in this thread), so this is the only approach left :-( So let's see if we can make it work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] logical decoding of two-phase transactions
On 07/16/2018 07:21 PM, Tom Lane wrote: Robert Haas writes: I agree. As a general statement, I think the idea of trying to prevent transactions from aborting is really scary. It's almost an axiom of the system that we're always allowed to abort, and I think there could be a lot of unintended and difficult-to-fix consequences of undermining that guarantee. I think it will be very difficult to create a sound system for delaying transactions, and I doubt very much that the proposed system is sound. Ugh, is this patch really dependent on such a thing? Unfortunately it does :-( Without it the decoding (or output plugins) may see catalogs broken in various ways - the catalog records may get vacuumed, HOT chains are broken, ... There were attempts to change that part, but that seems an order of magnitude more invasive than this. TBH, I think the odds of making that work are indistinguishable from zero; and even if you managed to commit something that did work at the instant you committed it, the odds that it would stay working in the face of later system changes are exactly zero. I would reject this idea out of hand. Why? How is this significantly different from other patches touching ProcArray and related bits? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra wrote: > I'm not sure I understand. Are you suggesting the process might get killed > or something, thanks to the CHECK_FOR_INTERRUPTS() call? Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local transfer of control. > But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's > true there are a few places where we do != NULL checks on the result without > holding any lock, but I don't see why that would be a problem? And before > actually inspecting the contents, the code always does > LockHashPartitionLockByProc. I think at least some of those cases are a problem. See below... > I don't follow. How could one process put another process into a decoding > group? I don't think that's possible. Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks up the process that currently has that XID, then turns that process into a decode group leader. Then after that function returns, the caller adds itself to the decode group as well. So it seems entirely possible for somebody to swing the decodeGroupLeader pointer for a PGPROC from NULL to some other value at an arbitrary point in time. > I'm not sure about the 'unsalvageable' part, but it needs more work, that's > for sure. Unfortunately, all previous attempts to make this work in various > other ways failed (see past discussions in this thread), so this is the only > approach left :-( So let's see if we can make it work. I think that's probably not going to work out, but of course it's up to you how you want to spend your time! After thinking about it a bit more, if you want to try to stick with this design, I don't think that this decode group leader/members thing has much to recommend it. In the case of parallel query, the point of the lock group stuff is to treat all of those processes as one for purposes of heavyweight lock acquisition. There's no similar need here, so the design that makes sure the "leader" is in the list of processes that are members of the "group" is, AFAICS, just wasted code. All you really need is a list of processes hung off of the PGPROC that must abort before the leader is allowed to abort; the leader itself doesn't need to be in the list, and there's no need to consider it as a "group". It's just a list of waiters. That having been said, I still don't see how that's really going to work. Just to take one example, suppose that the leader is trying to ERROR out, and the decoding workers are blocked waiting for a lock held by the leader. The system has no way of detecting this deadlock and resolving it automatically, which certainly seems unacceptable. The only way that's going to work is if the leader waits for the worker by trying to acquire a lock held by the worker. Then the deadlock detector would know to abort some transaction. But that doesn't really work either - the deadlock was created by the foreground process trying to abort, and if the deadlock detector chooses that process as its victim, what then? We're already trying to abort, and the abort code isn't supposed to throw further errors, or fail in any way, lest we break all kinds of other things. Not to mention the fact that running the deadlock detector in the abort path isn't really safe to begin with, again because we can't throw errors when we're already in an abort path. If we're only ever talking about decoding prepared transactions, we could probably work around all of these problems: have the decoding process take a heavyweight lock before it begins decoding. Have a process that wants to execute ROLLBACK PREPARED take a conflicting heavyweight lock on the same object. The net effect would be that ROLLBACK PREPARED would simply wait for decoding to finish. That might be rather lousy from a latency point of view since the transaction could take an arbitrarily long time to decode, but it seems safe enough. Possibly you could also design a mechanism for the ROLLBACK PREPARED command to SIGTERM the processes that are blocking its lock acquisition, if they are decoding processes. The difference between this and what you the current patch is doing is that nothing complex or fragile is happening in the abort pathway itself. The complicated stuff in both the worker and in the main backend happens while the transaction is still good and can still be rolled back at need. This kind of approach won't work if you want to decode transactions that aren't yet prepared, so if that is the long term goal then we need to think harder. I'm honestly not sure that problem has any reasonable solution. The assumption that a running process can abort at any time is deeply baked into many parts of the system and for good reasons. Trying to undo that is going to be like trying to push water up a hill. I think we need to install interlocks in such a way that any waiting happens before we enter the abort path, not while we're actually trying to perform the abort. But I don't
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/16/2018 10:34 AM, Claudio Freire wrote: On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html That page says: M_MMAP_MAX This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance. The default value is 65,536, a value which has no special significance and which serves only as a safeguard. Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests. I'm confused about the relevance. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan wrote: > > > > On 07/16/2018 10:34 AM, Claudio Freire wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > > wrote: > >> > >> > >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > >>> On 13/07/18 01:39, Andrew Dunstan wrote: > On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > > On 2018-Jul-12, Andrew Dunstan wrote: > > > >> I fully understand. I think this needs to go back to "Waiting on > >> Author". > > Why? Heikki's patch applies fine and passes the regression tests. > Well, I understood Claudio was going to do some more work (see > upthread). > >>> Claudio raised a good point, that doing small pallocs leads to > >>> fragmentation, and in particular, it might mean that we can't give > >>> back the memory to the OS. The default glibc malloc() implementation > >>> has a threshold of 4 or 32 MB or something like that - allocations > >>> larger than the threshold are mmap()'d, and can always be returned to > >>> the OS. I think a simple solution to that is to allocate larger > >>> chunks, something like 32-64 MB at a time, and carve out the > >>> allocations for the nodes from those chunks. That's pretty > >>> straightforward, because we don't need to worry about freeing the > >>> nodes in retail. Keep track of the current half-filled chunk, and > >>> allocate a new one when it fills up. > >> > >> Google seems to suggest the default threshold is much lower, like 128K. > >> Still, making larger allocations seems sensible. Are you going to work > >> on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > > the dynamic threshold is set to the size of the freed block. > > > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html > > > That page says: > > M_MMAP_MAX >This parameter specifies the maximum number of allocation >requests that may be simultaneously serviced using mmap(2). >This parameter exists because some systems have a limited >number of internal tables for use by mmap(2), and using more >than a few of them may degrade performance. > >The default value is 65,536, a value which has no special >significance and which serves only as a safeguard. Setting >this parameter to 0 disables the use of mmap(2) for servicing >large allocation requests. > > > I'm confused about the relevance. It isn't relevant. See my next message, it should have read DEFAULT_MMAP_THRESHOLD_MAX.
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/16/2018 11:35 AM, Claudio Freire wrote: On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire wrote: On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else. Ah, ok. Thanks. ignore the email I just sent about that. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas wrote: > Doesn't have to be a trigger, could be a CHECK constraint, datatype input > function, etc. Admittedly, having a datatype input function that inserts to > the table is worth a "huh?", but I'm feeling very confident that we can > catch all such cases, and some of them might even be sensible. Is this sentence missing a "not"? i.e. "I'm not feeling very confident"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WAL logging problem in 9.4.3?
On 16 July 2018 21:38:39 EEST, Robert Haas wrote: >On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas >wrote: >> Doesn't have to be a trigger, could be a CHECK constraint, datatype >input >> function, etc. Admittedly, having a datatype input function that >inserts to >> the table is worth a "huh?", but I'm feeling very confident that we >can >> catch all such cases, and some of them might even be sensible. > >Is this sentence missing a "not"? i.e. "I'm not feeling very >confident"? Yes, sorry. - Heikki
Alter index rename concurrently to
Dear hackers! I have an idea to facilitate work with index rebuilding. Usually if we want to rebuild an index without table locks we should do the queries below: 1. create index concurrently... (with different name on the same columns) 2. drop index concurrently 3. alter index rename to As you can see, only the last query in this simple algorithm locks table. Commonly this steps are included to a more complex script and run by cron or manually. When you have high load databases, maybe some problems appear: 1. it’s dangerous and unacceptable for production to wait unknown number of minutes or even hours while a table is locked 2. we must write a complex script that sets statement timeout to rename and implement several checks there to prevent stopping of production at nights 3. it’s uncomfortable to find indexes that couldn’t be renamed during script executing Then we must execute renaming manually and interrupt it again and again if it can’t execute in allowable time. I made a patch to solve this issue (see the attachment). It allows to avoid locks by a query like this: “alter index rename CONCURRENTLY to ”. Briefly, I did it by using similar way in the index_create() and index_drop() functions that set ShareUpdateExclusiveLock instead of AccessShareLock when the structure DropStmt/CreateStmt has “true” in the stmt->concurrent field. (In other words, when it "see" "concurretly" in query). In all other cases (alter table, sequence, etc) I initialized this field as “false” respectively. I ran it when another transactions to the same table are started but not finished. Also I used a script that made SELECT\DML queries in a loop to that test tables and "ALTER INDEX ... RENAME CONCURRENTLY TO ..." works without any errors and indexes were valid after renaming. Maybe it’s interesting for someone. I’ll be very thankful for any ideas! -- Kind regards, Andrew Klychkov# Author: Andrew Klychkov # 16-07-2018 diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 482d463..47e20d5 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1658,14 +1658,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u", OIDOldHeap); RenameRelationInternal(newrel->rd_rel->reltoastrelid, - NewToastName, true); + NewToastName, true, false); /* ... and its valid index too. */ snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", OIDOldHeap); RenameRelationInternal(toastidx, - NewToastName, true); + NewToastName, true, false); } relation_close(newrel, NoLock); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 22e81e7..ead4f26 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -3012,7 +3012,7 @@ rename_constraint_internal(Oid myrelid, || con->contype == CONSTRAINT_UNIQUE || con->contype == CONSTRAINT_EXCLUSION)) /* rename the index; this renames the constraint as well */ - RenameRelationInternal(con->conindid, newconname, false); + RenameRelationInternal(con->conindid, newconname, false, false); else RenameConstraintById(constraintOid, newconname); @@ -3082,7 +3082,9 @@ RenameRelation(RenameStmt *stmt) { Oid relid; ObjectAddress address; +LOCKMODElockmode; +lockmode = stmt->concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock; /* * Grab an exclusive lock on the target table, index, sequence, view, * materialized view, or foreign table, which we will NOT release until @@ -3091,7 +3093,7 @@ RenameRelation(RenameStmt *stmt) * Lock level used here should match RenameRelationInternal, to avoid lock * escalation. */ - relid = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock, + relid = RangeVarGetRelidExtended(stmt->relation, lockmode, stmt->missing_ok ? RVR_MISSING_OK : 0, RangeVarCallbackForAlterRelation, (void *) stmt); @@ -3105,7 +3107,7 @@ RenameRelation(RenameStmt *stmt) } /* Do the work */ - RenameRelationInternal(relid, stmt->newname, false); + RenameRelationInternal(relid, stmt->newname, false, stmt->concurrent); ObjectAddressSet(address, RelationRelationId, relid); @@ -3122,20 +3124,22 @@ RenameRelation(RenameStmt *stmt) * sequence, AFAIK there's no need for it to be there. */ void -RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal) +RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, bool concurrent) { Relation targetrelation; Relation relrelation; /* for RELATION relation */ HeapTuple reltup; Form_pg_class relform; Oid namespaceId; +LOCKMODElockmode; /* * Grab an exclusive lock on the target table, index, sequence, view, * materialized view, or foreign table, which we will NOT release until * end of transaction. */ - targetrelation = re
Re: Alter index rename concurrently to
Hi! > 16 июля 2018 г., в 22:58, Andrey Klychkov написал(а): > Dear hackers! > > I have an idea to facilitate work with index rebuilding. > > "ALTER INDEX ... RENAME CONCURRENTLY TO ..." The idea seems useful. I'm not an expert in CIC, but your post do not cover one very important topic. When we use CREATE INDEX CONCURRENTLY we pay for less intrusive lock by scanning data twice. What is the price of RENAME CONCURRENTLY? Should this mode be default? Best regards, Andrey Borodin.
Re: [HACKERS] logical decoding of two-phase transactions
On 07/16/2018 08:09 PM, Robert Haas wrote: > On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra > wrote: >> I'm not sure I understand. Are you suggesting the process might get killed >> or something, thanks to the CHECK_FOR_INTERRUPTS() call? > > Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local > transfer of control. > >> But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's >> true there are a few places where we do != NULL checks on the result without >> holding any lock, but I don't see why that would be a problem? And before >> actually inspecting the contents, the code always does >> LockHashPartitionLockByProc. > > I think at least some of those cases are a problem. See below... > >> I don't follow. How could one process put another process into a decoding >> group? I don't think that's possible. > > Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks > up the process that currently has that XID, then turns that process > into a decode group leader. Then after that function returns, the > caller adds itself to the decode group as well. So it seems entirely > possible for somebody to swing the decodeGroupLeader pointer for a > PGPROC from NULL to some other value at an arbitrary point in time. > Oh, right, I forgot the patch also adds the leader into the group, for some reason (I agree it's unclear why that would be necessary, as you pointed out later). But all this is happening while holding the partition lock (in exclusive mode). And the decoding backends do synchronize on the lock correctly (although, man, the rechecks are confusing ...). But now I see ProcKill accesses decodeGroupLeader in multiple places, and only the first one is protected by the lock, for some reason (interestingly enough the one in lockGroupLeader block). Is that what you mean? FWIW I suspect the ProcKill part is borked due to incorrectly resolved merge conflict or something, per my initial response from today. >> I'm not sure about the 'unsalvageable' part, but it needs more work, that's >> for sure. Unfortunately, all previous attempts to make this work in various >> other ways failed (see past discussions in this thread), so this is the only >> approach left :-( So let's see if we can make it work. > > I think that's probably not going to work out, but of course it's up > to you how you want to spend your time! > Well, yeah. I'm sure I could think of more fun things to do, but OTOH I also have patches that require the capability to decode in-progress transactions. > After thinking about it a bit more, if you want to try to stick with > this design, I don't think that this decode group leader/members thing > has much to recommend it. In the case of parallel query, the point of > the lock group stuff is to treat all of those processes as one for > purposes of heavyweight lock acquisition. There's no similar need > here, so the design that makes sure the "leader" is in the list of > processes that are members of the "group" is, AFAICS, just wasted > code. All you really need is a list of processes hung off of the > PGPROC that must abort before the leader is allowed to abort; the > leader itself doesn't need to be in the list, and there's no need to > consider it as a "group". It's just a list of waiters. > But the way I understand it, it pretty much *is* a list of waiters, along with a couple of flags to allow the processes to notify the other side about lock/unlock/abort. It does resemble the lock groups, but I don't think it has the same goals. The thing is that the lock/unlock happens for each decoded change independently, and it'd be silly to modify the list all the time, so instead it just sets the decodeLocked flag to true/false. Similarly, when the leader decides to abort, it marks decodeAbortPending and waits for the decoding backends to complete. Of course, that's my understanding/interpretation, and perhaps Nikhil as a patch author has a better explanation. > That having been said, I still don't see how that's really going to > work. Just to take one example, suppose that the leader is trying to > ERROR out, and the decoding workers are blocked waiting for a lock > held by the leader. The system has no way of detecting this deadlock > and resolving it automatically, which certainly seems unacceptable. > The only way that's going to work is if the leader waits for the > worker by trying to acquire a lock held by the worker. Then the > deadlock detector would know to abort some transaction. But that > doesn't really work either - the deadlock was created by the > foreground process trying to abort, and if the deadlock detector > chooses that process as its victim, what then? We're already trying > to abort, and the abort code isn't supposed to throw further errors, > or fail in any way, lest we break all kinds of other things. Not to > mention the fact that running the deadlock detector in the abort path > isn't really safe to begin
Re: Alter index rename concurrently to
пн, 16 июл. 2018 г. в 21:58, Andrey Klychkov : > I made a patch to solve this issue (see the attachment). > It allows to avoid locks by a query like this: > “alter index rename CONCURRENTLY to ”. > Please, have a look at previous discussions on the subject: - 2012 https://www.postgresql.org/message-id/cab7npqtys6juqdxuczbjb0bnw0kprw8wdzuk11kaxqq6o98...@mail.gmail.com - 2013 https://www.postgresql.org/message-id/cab7npqstfkuc7dzxcdx4hotu63yxhrronv2aouzyd-zz_8z...@mail.gmail.com - https://commitfest.postgresql.org/16/1276/ -- Victor Yegorov
Re: Fix some error handling for read() and errno
On 2018-Jul-16, Michael Paquier wrote: > On Sat, Jul 14, 2018 at 03:37:56PM +0900, Michael Paquier wrote: > > For now, I think that just moving forward with 0001, and then revisit > > 0002 once the other 2PC patch is settled makes the most sense. On the > > other thread, the current 2PC behavior can create silent data loss so > > I would like to back-patch it, so that would be less work. > > Are there any objections with this plan? If none, then I would like to > move on with 0001 as there is clearly a consensus to simplify the work > of translators and to clean up the error code paths for read() calls. > Let's sort of the rest after the 2PC code paths are addressed. No objection here -- incremental progress is better than none. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2018-Jul-12, Heikki Linnakangas wrote: > > > Thanks for the pointer. My tap test has been covering two out of > > > the three scenarios you have in your script. I have been able to > > > convert the extra as the attached, and I have added as well an > > > extra test with TRUNCATE triggers. So it seems to me that we want > > > to disable the optimization if any type of trigger are defined on > > > the relation copied to as it could be possible that these triggers > > > work on the blocks copied as well, for any BEFORE/AFTER and > > > STATEMENT/ROW triggers. What do you think? > > > > Yeah, this seems like the only sane approach. > > Doesn't have to be a trigger, could be a CHECK constraint, datatype > input function, etc. Admittedly, having a datatype input function that > inserts to the table is worth a "huh?", but I'm feeling very confident > that we can catch all such cases, and some of them might even be > sensible. A counterexample could be a a JSON compresion scheme that uses a catalog for a dictionary of keys. Hasn't this been described already? Also not completely out of the question for GIS data, I think (Not sure if PostGIS does this already.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: New GUC to sample log queries
On 07/16/2018 05:24 PM, Robert Haas wrote: > On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing > wrote: >> Hmm. Not sure if that last word should be _sample, _sampling, _rate, or >> a combination of those. > > +1 for rate or sample_rate. I think "sample" or "sampling" without > "rate" will not be very clear. > Thanks, After reading this mail : https://www.postgresql.org/message-id/20180710183828.GB3890%40telsasoft.com I wonder if this GUC could be used for theses logging method? log_statement_stats log_parser_stats log_planner_stats log_executor_stats -- Adrien signature.asc Description: OpenPGP digital signature
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
I wrote: > So I said I didn't want to do extra work on this, but I am looking into > fixing it by having these aux process types run a ResourceOwner that can > be told to clean up any open buffer pins at exit. We could be sure the > coverage is complete by dint of removing the special-case code in > resowner.c that allows buffer pins to be taken with no active resowner. > Then CheckForBufferLeaks can be left as-is, ie something we do only > in assert builds. That turned out to be a larger can of worms than I'd anticipated, as it soon emerged that we'd acquired a whole lot of cargo-cult programming around ResourceOwners. Having a ResourceOwner in itself does nothing for you: there has to be code someplace that ensures we'll call ResourceOwnerRelease at an appropriate time. There was basically noplace outside xact.c and portalmem.c that got this completely right. bgwriter.c and a couple of other places at least tried a little, by doing a ResourceOwnerRelease in their sigsetjmp-catching stanzas, but that didn't account for the process-exit code path. Other places just created a ResourceOwner and did *nothing* about cleaning it up. I decided that the most expedient way of dealing with this was to create a self-contained facility in resowner.c that would create a standalone ResourceOwner and register a shmem-exit callback to clean it up. That way it's easier (less code) to do it right than to do it wrong. That led to the attached, which passes check-world as well as Michael's full-disk test script. I'm mostly pretty happy with this, but I think there are a couple of loose ends in logicalfuncs.c and slotfuncs.c: those are creating non-standalone ResourceOwners (children of whatever the active ResourceOwner is) and doing nothing much to clean them up. That seems pretty bogus. It's not a permanent resource leak, because sooner or later the parent ResourceOwner will get cleaned up and that will recurse to the child ... but then why bother with a child ResourceOwner at all? I added asserts to these calls to verify that there was a parent resowner (if there isn't, the code is just broken), but I think that we should either add more code to clean up the child resowner promptly, or just not bother with a child at all. Not very sure what to do with this. I don't think we should try to backpatch it, because it's essentially changing the API requirements around buffer access in auxiliary processes --- but it might not be too late to squeeze it into v11. It is a bug fix, in that the current code can leak buffer pins in some code paths and we won't notice in non-assert builds; but we've not really seen any field reports of that happening, so I'm not sure how important this is. regards, tom lane diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 4e32cff..3a5cd0e 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -1221,8 +1221,7 @@ ParallelWorkerMain(Datum main_arg) memcpy(&ParallelWorkerNumber, MyBgworkerEntry->bgw_extra, sizeof(int)); /* Set up a memory context and resource owner. */ - Assert(CurrentResourceOwner == NULL); - CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel"); + CreateAuxProcessResourceOwner(); CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext, "Parallel worker", ALLOCSET_DEFAULT_SIZES); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4049deb..5d01edd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6356,6 +6356,15 @@ StartupXLOG(void) struct stat st; /* + * We should have an aux process resource owner to use, and we should not + * be in a transaction that's installed some other resowner. + */ + Assert(AuxProcessResourceOwner != NULL); + Assert(CurrentResourceOwner == NULL || + CurrentResourceOwner == AuxProcessResourceOwner); + CurrentResourceOwner = AuxProcessResourceOwner; + + /* * Verify XLOG status looks valid. */ if (ControlFile->state < DB_SHUTDOWNED || @@ -8467,6 +8476,15 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch) void ShutdownXLOG(int code, Datum arg) { + /* + * We should have an aux process resource owner to use, and we should not + * be in a transaction that's installed some other resowner. + */ + Assert(AuxProcessResourceOwner != NULL); + Assert(CurrentResourceOwner == NULL || + CurrentResourceOwner == AuxProcessResourceOwner); + CurrentResourceOwner = AuxProcessResourceOwner; + /* Don't be chatty in standalone mode */ ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg("shutting down"))); diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index 7e34bee..cdd71a9 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -403,6 +403,13 @@ AuxiliaryProcessMain(int argc, char *argv[]) /* finish settin
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 07/16/2018 02:54 PM, Dean Rasheed wrote: > On 16 July 2018 at 13:23, Tomas Vondra wrote: The top-level clauses allow us to make such deductions, with deeper clauses it's much more difficult (perhaps impossible). Because for example with (a=1 AND b=1) there can be just a single match, so if we find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR b=2)) it's not that simple, because there may be multiple combinations and so a match in MCV does not guarantee anything. >>> >>> Actually, it guarantees a lower bound on the overall selectivity, and >>> maybe that's the best that we can do in the absence of any other >>> stats. >>> >> Hmmm, is that actually true? Let's consider a simple example, with two >> columns, each with just 2 values, and a "perfect" MCV list: >> >> a | b | frequency >>--- >> 1 | 1 | 0.5 >> 2 | 2 | 0.5 >> >> And let's estimate sel(a=1 & b=2). > > OK.In this case, there are no MCV matches, so there is no lower bound (it's > 0). > > What we could do though is also impose an upper bound, based on the > sum of non-matching MCV frequencies. In this case, the upper bound is > also 0, so we could actually say the resulting selectivity is 0. > Hmmm, it's not very clear to me how would we decide which of these cases applies, because in most cases we don't have MCV covering 100% rows. Anyways, I've been thinking about how to modify the code to wort the way you proposed (in a way sufficient for a PoC). But after struggling with it for a while it occurred to me it might be useful to do it on paper first, to verify how would it work on your examples. So let's use this data create table foo(a int, b int); insert into foo select 1,1 from generate_series(1,5); insert into foo select 1,2 from generate_series(1,4); insert into foo select 1,x/10 from generate_series(30,25) g(x); insert into foo select 2,1 from generate_series(1,3); insert into foo select 2,2 from generate_series(1,2); insert into foo select 2,x/10 from generate_series(30,50) g(x); insert into foo select 3,1 from generate_series(1,1); insert into foo select 3,2 from generate_series(1,5000); insert into foo select 3,x from generate_series(3,60) g(x); insert into foo select x,x/10 from generate_series(4,75) g(x); Assuming we have perfectly exact statistics, we have these MCV lists (both univariate and multivariate): select a, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a order by 2 desc; a| count | frequency ++--- 3 | 614998 |0.2727 2 | 549971 |0.2439 1 | 339971 |0.1508 1014 | 1 |0. 57220 | 1 |0. ... select b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by b order by 2 desc; b| count | frequency +---+--- 1 | 90010 |0.0399 2 | 65010 |0.0288 3 |31 |0. 7 |31 |0. ... select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a, b order by 3 desc; a| b| count | frequency ++---+--- 1 | 1 | 5 |0.0222 1 | 2 | 4 |0.0177 2 | 1 | 3 |0.0133 2 | 2 | 2 |0.0089 3 | 1 | 1 |0.0044 3 | 2 | 5000 |0.0022 2 | 12445 |10 |0. ... And let's estimate the two queries with complex clauses, where the multivariate stats gave 2x overestimates. SELECT * FROM foo WHERE a=1 and (b=1 or b=2); -- actual 9, univariate: 24253, multivariate: 181091 univariate: sel(a=1) = 0.1508 sel(b=1) = 0.0399 sel(b=2) = 0.0288 sel(b=1 or b=2) = 0.0673 multivariate: sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770) The second multivariate estimate comes from assuming only the first 5 items make it to the multivariate MCV list (covering 6.87% of the data) and extrapolating the selectivity to the non-MCV data too. (Notice it's about 2x the actual selectivity, so this extrapolation due to not realizing the MCV already contains all the matches is pretty much responsible for the whole over-estimate). So, how would the proposed algorithm work? Let's start with "a=1": sel(a=1) = 0.1508 I don't see much point in applying the two "b" clauses independently (or how would it be done, as it's effectively a single clause): sel(b=1 or b=2) = 0.0673 And we get total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 >From the multivariate MCV we get mcv_sel = 0.0399 And finally total_sel = Max(total_sel, mcv_sel) = 0.0399 Which is great, but I don't see how that actually helped anything? We still only have the estimate for the ~7% covered by the MCV list, and not the remaining non-MCV part. I could do the same thing for the second qu
Re: New GUC to sample log queries
On 07/16/2018 05:24 PM, Robert Haas wrote: > On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing > wrote: >> Hmm. Not sure if that last word should be _sample, _sampling, _rate, or >> a combination of those. > > +1 for rate or sample_rate. I think "sample" or "sampling" without > "rate" will not be very clear. > 1+ to sample_rate It's what auto_explain and pgbench uses, so let's keep the naming consistent. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: AtEOXact_ApplyLauncher() and subtransactions
On Mon, Jul 16, 2018 at 2:36 AM, Amit Khandekar wrote: > 0001 patch contains the main fix. In this patch I have used some > naming conventions and some comments that you used in your patch, > plus, I used your method of lazily allocating new stack level. The > stack is initially Null. Committed and back-patched to v11 and v10. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: patch to allow disable of WAL recycling
There have been quite a few comments since last week, so at this point I am uncertain how to proceed with this change. I don't think I saw anything concrete in the recent emails that I can act upon. I would like to respond to the comment about trying to "self-tune" the behavior based on inferences made about caching during setup. I can't speak for many other filesystems, but for ZFS, the ARC size is not fixed and will vary based on the memory demands against the machine. Also, what files are cached will vary based upon the workloads running on the machine. Thus, I do not think there is a valid way to make inferences about future caching behavior based upon a point-in-time observation. I am still happy to update the man pages to explain the new tunable better if that is acceptable. Thanks, Jerry On Sun, Jul 15, 2018 at 6:32 PM, Robert Haas wrote: > On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund wrote: > > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE* > > performance impact of non COW filesystems, and very likely even negative > > impacts in a number of COWed scenarios (when there's enough memory to > > keep all WAL files in memory). > > > > I still think that fixing this another way would be preferrable. This'll > > be too much of a magic knob that depends on the fs, hardware and > > workload. > > I tend to agree with you, but unless we have a pretty good idea what > that other way would be, I think we should probably accept the patch. > > Could we somehow make this self-tuning? On any given > filesystem/hardware/workload, either creating a new 16MB file is > faster, or recycling an old file is faster. If the old file is still > cached, recycling it figures to win on almost any hardware. If not, > it seems like something of a toss-up. I suppose we could try to keep > a running average of how long it is taking us to recycle WAL files and > how long it is taking us to create new ones; if we do each one of > those things at least sometimes, then we'll eventually get an idea of > which one is quicker. But it's not clear to me that such data would > be very reliable unless we tried to make sure that we tried both > things fairly regularly under circumstances where we could have chosen > to do the other one. > > I think part of the problem here is that whether a WAL segment is > likely to be cached depends on a host of factors which we don't track > very carefully, like whether it's been streamed or decoded recently. > If we knew when that a particular WAL segment hadn't been accessed for > any purpose in 10+ minutes, it would probably be fairly safe to guess > that it's no longer in cache; if we knew that it had been accessed <15 > seconds ago, that it is probably still in cache. But we have no idea. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On 2018-Jul-16, Ashutosh Bapat wrote: > > Hmm, let me reword this comment completely. How about the attached? > That looks much better. However it took me a small while to understand > that (1), (2) and (3) correspond to strategies. You're right. Amended again and pushed. I also marked the open item closed. Thanks everyone -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Jul 16, 2018 at 09:41:51PM +0300, Heikki Linnakangas wrote: > On 16 July 2018 21:38:39 EEST, Robert Haas wrote: >>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas >>wrote: >>> Doesn't have to be a trigger, could be a CHECK constraint, datatype >>input >>> function, etc. Admittedly, having a datatype input function that >>inserts to >>> the table is worth a "huh?", but I'm feeling very confident that we >>can >>> catch all such cases, and some of them might even be sensible. >> >>Is this sentence missing a "not"? i.e. "I'm not feeling very >>confident"? > > Yes, sorry. This explains a lot :p I doubt as well that we'd be able to catch all the holes as well as the conditions where the optimization could be run safely are rather basically impossible to catch beforehand. I'd like to vote for getting rid of this optimization for COPY, this can hurt more than it is helpful. Per the lack of complaints, this could happen only in HEAD? -- Michael signature.asc Description: PGP signature
Re: Fix error message when trying to alter statistics on included column
On 2018-Jun-28, Yugo Nagata wrote: > According to the error message, it is not allowed to alter statistics on > included column because this is "non-expression column". > > postgres=# create table test (i int, d int); > CREATE TABLE > postgres=# create index idx on test(i) include (d); > CREATE INDEX > postgres=# alter index idx alter column 2 set statistics 10; > ERROR: cannot alter statistics on non-expression column "d" of index "idx" > HINT: Alter statistics on table column instead. > > However, I think this should be forbidded in that this is not a key column > but a included column. Even if we decide to support expressions in included > columns in future, it is meaningless to do this because any statistics on > included column is never used by the planner. I agree with this reasoning, so I pushed this patch. Thanks! I added a couple of lines in the regress file for this feature also. Teodor, Alexander, now would be the time to express dissent :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
On 16 July 2018 at 06:55, Tom Lane wrote: > I started to look at this patch. I think this is basically the right > direction to go in, but I'm not terribly happy with the details of the > data structure design. I've made an attempt at addressing the issues that I understood. I've not done anything about your Bitmapset for non-matching partitions. I fail to see how that would improve the code. But please feel free to provide details of your design and I'll review it and let you know what I think about it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v2-0001-Fix-run-time-partition-pruning-for-UNION-ALL-pare.patch Description: Binary data
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On 2018/07/17 8:17, Alvaro Herrera wrote: > On 2018-Jul-16, Ashutosh Bapat wrote: > >>> Hmm, let me reword this comment completely. How about the attached? > >> That looks much better. However it took me a small while to understand >> that (1), (2) and (3) correspond to strategies. > > You're right. Amended again and pushed. I also marked the open item > closed. > > Thanks everyone Thank you Ashutosh for further review and Alvaro for improving it a quite a bit before committing. Thanks, Amit
Re: patch to allow disable of WAL recycling
On Mon, Jul 16, 2018 at 10:38:14AM -0400, Robert Haas wrote: > It's been a few years since I tested this, but my recollection is that > if you fill up pg_xlog, the system will PANIC and die on a vanilla > Linux install. Sure, you can set max_wal_size, but that's a soft > limit, not a hard limit, and if you generate WAL faster than the > system can checkpoint, you can overrun that value and force allocation > of additional WAL files. So I'm not sure we have any working > ENOSPC-panic protection today. Given that, I'm doubtful that we > should prioritize maintaining whatever partially-working protection we > may have today over raw performance. If we want to fix ENOSPC on > pg_wal = PANIC, and I think that would be a good thing to fix, then we > should do it either by finding a way to make the WAL insertion ERROR > out instead of panicking, or throttle WAL generation as we get close > to disk space exhaustion so that the checkpoint has time to complete, > as previously proposed by Heroku. I would personally prefer seeing max_wal_size being switched to a hard limit, and make that tunable. I am wondering if that's the case for other people on this list, but I see from time to time, every couple of weeks, people complaining that Postgres is not able to maintain a hard guarantee behind the value of max_wal_size. In some upgrade scenarios, I had to tell such folks to throttle their insert load and also manually issue checkpoints to allow the system to stay up and continue with the upgrade process. So there are definitely cases where throttling is useful, and if the hard limit is reached for some cases I would rather see WAL generation from other backends simply stopped instead of risking the system to go down so as the system can finish its checkpoint. And sometimes this happens also with a SQL dump, where throttling the load at the application level means more complex dump strategy so as things are split between multiple files for example. -- Michael signature.asc Description: PGP signature
Re: PATCH: psql tab completion for SELECT
On 17 July 2018 at 00:00, Heikki Linnakangas wrote: > Playing around with this a little bit, I'm not very satisfied with the > completions. Sometimes this completes too much, and sometimes too little. > All of this has been mentioned in this and the other thread [1] already, > this just my opinion on all this. Hi Heikki, thanks for getting this thread active again. I do actually have another patch, which I didn't post yet because everyone was busy with v11, and I wanted to wait for more feedback about inclusions/exclusions. Holding it back now seems like a mistake (especially since the last patch had a bug) so here it is. There's also a patch, to be applied on top, that adds completion after commas. > Too much: > > postgres=# \d lp >Table "public.lp" >Column | Type | Collation | Nullable | Default > --+-+---+--+- > id | integer | | | > partkey | text| | | > partcol1 | text| | | > partcol2 | text| | | > Partition key: LIST (partkey) > Number of partitions: 1000 (Use \d+ to list them.) > > postgres=# select pa[TAB] > pad_attribute parameter_default parameter_stylepartattrs partcol2 > partexprs partrelid > page parameter_mode parameter_typespartclass > partcollation partkeypartstrat > pageno parameter_name parse_ident( partcol1 partdefid > partnatts passwd I agree that there is too much. I don't know what the best way to reduce it is, short of specifically excluding certain things. In theory, I think the query could be sensitive to how much text is entered, for example, let pa[TAB] not show partrelid and other columns from pg_partition, but part[TAB] show them; the general rule could be "only show attributes for system tables if 3 or more letters entered". But I'm wary of overcomplicating a patch that is (IMHO) a useful improvement even if simple. > Too little: > > postgres=# select partkey, p[TAB] > [no completions] > > > Then there's the multi-column case, which seems weird (to a user - I know > the implementation and understand why): > > postgres=# select partkey, partc[TAB] > [no completions] Yep, there was no completion after commas in the previous patches. > And I'd love this case, where go back to edit the SELECT list, after already > typing the FROM part, to be smarter: > > postgres=# select p[TAB] FROM lp; > Display all 370 possibilities? (y or n) I don't know enough about readline to estimate how easy this would be. I think all our current completions use only the text up to the cursor. > There's something weird going on with system columns, from a user's point of > view: > > SELECT oi[TAB] > oid oidvectortypes( > > postgres=# select xm[TAB] > xmin xmlcomment( xml_is_well_formed_content( > xmlvalidate( > xmlagg( xml_is_well_formed( > xml_is_well_formed_document( > > So oid and xmin are completed. But "xmax" and "ctid" are not. I think this > is because in fact none of the system columns are completed, but there > happen to be some tables with regular columns named "oid" and "xmin". So it > makes sense once you know that, but it's pretty confusing to a user. > Tab-completion is a great way for a user to explore what's available, so > it's weird that some system columns are effectively completed while others > are not. You are correct, system columns are excluded, but of course there are always the same column names in other tables. Do you think we should just include all columns? > I'd like to not include set-returning functions from the list. Although you > can do "SELECT generate_series(1,10)", I'd like to discourage people from > doing that, since using set-returning functions in the target list is a > PostgreSQL extension to the SQL standard, and IMHO the "SELECT * FROM > generate_series(1,10)" syntax is easier to understand and works more sanely > with joins etc. Conversely, it would be really nice to include set-returning > function in the completions after FROM. I'm happy to exclude SRFs after SELECT if others agree. Including them after FROM seems worthwhile, but best left for a different patch? > I understand that there isn't much you can do about some of those things, > short of adding a ton of new context information about previous queries and > heuristics. I think the completion needs to be smarter to be acceptable. I > don't know what exactly to do, but perhaps someone else has ideas. > > I'm also worried about performance, especially of the query to get all the > column names. It's pretty much guaranteed to do perform a > SeqScan+Sort+Unique on pg_attribute. pg_attribute can be very large. In my > little test database, with the above 1000-partition table, hitting tab after > "SELECT " takes about 1 second, until you get the "Display all 1000 > possibili
Re: PATCH: psql tab completion for SELECT
On 17 July 2018 at 03:27, Joao De Almeida Pereira wrote: > After playing alittle bit around with the patch I noticed that a comma was > missing in line 1214 > + 1202 /* min_server_version */ > + 1203 9, > + 1204 /* catname */ > + 1205 "pg_catalog.pg_proc p", > + 1206 /* selcondition */ > + 1207 "p.prorettype NOT IN ('trigger'::regtype, > 'internal'::regtype) " > + 1208 "AND 'internal'::regtype != ALL (p.proargtypes) " > + 1209 "AND p.oid NOT IN (SELECT > unnest(array[typinput,typoutput,typreceive,typsend,typmodin,typmodout,typanalyze]) > FROM pg_type) " > + 1210 "AND p.oid NOT IN (SELECT > unnest(array[aggtransfn,aggfinalfn]) FROM pg_aggregate) " > + 1211 "AND p.oid NOT IN (SELECT > unnest(array[oprcode,oprrest,oprjoin]) FROM pg_operator) " > + 1212 "AND p.oid NOT IN (SELECT > unnest(array[lanplcallfoid,laninline,lanvalidator]) FROM pg_language) " > + 1213 "AND p.oid NOT IN (SELECT castfunc FROM pg_cast) " > + 1214 "AND p.oid NOT IN (SELECT amproc FROM pg_amproc) " > + 1215 /* viscondition */ > + 1216 "pg_catalog.pg_function_is_visible(p.oid)", > > To catch these typos would be good if we could get some testing around psql. > (Newbie question: do we have any kind of testing around tools like psql?) > > Thanks > Joao Hi Joao, Ah yes, the embarrassing missing comma. I kind of wish my IDE or compiler had pointed it out to me, but how was it to know that I foolishly combining two struct fields? Sadly, I think the best defence right now is to have others scrutinise the code. I attached a new version of the patch in reply to Heikki. Would you care to take a careful look for me? I think some automated test framework for the tab completion queries is possible, by calling psql_completion and examining the results. Maybe someone will volunteer... Edmund
Re: How to make partitioning scale better for larger numbers of partitions
On 2018/07/13 22:10, Ashutosh Bapat wrote: > On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho wrote: >>> I wondered if you compared to PG10 or to inheritence-partitioning (parent >>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into >>> child) ? >> >> Thank you for your reply. >> >> I compared to PG11beta2 with non-partitioned table. >> >> Non-partitioned table has 1100 records in one table. >> Partitioned table has one record on each leaf partitions. >> > > I don't think partitioning should be employed this way even for the > sake of comparison. Depending upon the size of each tuple, 1100 tuples > are inserted into a single table, they will probably occupy few > hundred pages. In a partitioned table with one tuple per partition > they will occupy 1100 pages at least. There is other space, locking > overheads to maintain 1100 tables. I think the right way to compare is > to have really large that which really requires 1100 partitions and > then compare performance by putting that data in 1100 partitions and > in an unpartitioned table. Even with that kind of data, you will see > some difference in performance, but that won't be as dramatic as you > report. > > I might be missing something though. Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table. It is and has been clear that that's because the planning time explodes as the number of partitions increases. If there's lots of data in it, then the result will look completely different as you say, because scanning a single partition (of the 1100 total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data. But the planning time would still be more for the partitioned table, which seems to be the point of this benchmark. Thanks, Amit
Re: Make foo=null a warning by default.
On Mon, Jul 16, 2018 at 11:37:28AM -0400, Tom Lane wrote: > Heikki Linnakangas writes: > > On 16/07/18 18:10, Tom Lane wrote: > >> TBH I'm not really excited about investing any work in this area at all. > >> Considering how seldom we hear any questions about transform_null_equals > >> anymore[1], I'm wondering if we couldn't just rip the "feature" out > >> entirely. > > > Yeah, I was wondering about that too. But Fabien brought up a completely > > new use-case for this: people learning SQL. For beginners who don't > > understand the behavior of NULLs yet, I can see a warning or error being > > useful training wheels. Perhaps a completely new "training_wheels=on" > > option, which could check may for many other beginner errors, too, would > > be better for that. > > Agreed --- but what we'd want that to do seems only vaguely related to > the existing behavior of transform_null_equals. As an example, we > intentionally made transform_null_equals *not* trigger on > > CASE x WHEN NULL THEN ... > > but a training-wheels warning for that would likely be reasonable. > > For that matter, many of the old threads about this are complaining > about nulls that aren't simple literals in the first place. I wonder > whether a training-wheels feature that whined *at runtime* about null > WHERE-qual or case-test results would be more useful than a parser > check. Am I understanding correctly that this would be looking for bogus conditions in the plan tree? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
RE: How to make partitioning scale better for larger numbers of partitions
On 2018/07/17 10:49, Amit Langote wrote: >Perhaps, Kato-san only intended to report that the time that planner spends >for a partitioned table with 1100 partitions is just too high compared to the >time it spends on a non-partitioned table. yes, It is included for the purposes of this comparison. The purpose of this comparison is to find where the partitioning bottleneck is. Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performance of unpartitioned table as much as possible. -Original Message- From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] Sent: Tuesday, July 17, 2018 10:49 AM To: Ashutosh Bapat ; Kato, Sho/加藤 翔 Cc: Justin Pryzby ; PostgreSQL mailing lists Subject: Re: How to make partitioning scale better for larger numbers of partitions On 2018/07/13 22:10, Ashutosh Bapat wrote: > On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho wrote: >>> I wondered if you compared to PG10 or to inheritence-partitioning (parent >>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into >>> child) ? >> >> Thank you for your reply. >> >> I compared to PG11beta2 with non-partitioned table. >> >> Non-partitioned table has 1100 records in one table. >> Partitioned table has one record on each leaf partitions. >> > > I don't think partitioning should be employed this way even for the > sake of comparison. Depending upon the size of each tuple, 1100 tuples > are inserted into a single table, they will probably occupy few > hundred pages. In a partitioned table with one tuple per partition > they will occupy 1100 pages at least. There is other space, locking > overheads to maintain 1100 tables. I think the right way to compare is > to have really large that which really requires 1100 partitions and > then compare performance by putting that data in 1100 partitions and > in an unpartitioned table. Even with that kind of data, you will see > some difference in performance, but that won't be as dramatic as you > report. > > I might be missing something though. Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table. It is and has been clear that that's because the planning time explodes as the number of partitions increases. If there's lots of data in it, then the result will look completely different as you say, because scanning a single partition (of the 1100 total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data. But the planning time would still be more for the partitioned table, which seems to be the point of this benchmark. Thanks, Amit
Re: How to make partitioning scale better for larger numbers of partitions
On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho wrote: > On 2018/07/17 10:49, Amit Langote wrote: >>Perhaps, Kato-san only intended to report that the time that planner spends >>for a partitioned table with 1100 partitions is just too high compared to the >>time it spends on a non-partitioned table. > > yes, It is included for the purposes of this comparison. > > The purpose of this comparison is to find where the partitioning bottleneck > is. > Using the bottleneck as a hint of improvement, I'd like to bring the > performance of partitioned table closer to the performance of unpartitioned > table as much as possible. > That's a good thing, but may not turn out to be realistic. We should compare performance where partitioning matters and try to improve in those contexts. Else we might improve performance in scenarios which are never used. In this case, even if we improve the planning time by 100%, it hardly matters since planning time is neglegible compared to the execution time because of huge data where partitioning is useful. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Let's remove DSM_IMPL_NONE.
Hello. At Tue, 10 Jul 2018 18:52:54 +0200, Peter Eisentraut wrote in > committed Thank you for committing this! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)
On Mon, Jul 16, 2018 at 11:22:07AM -0400, Robert Haas wrote: > This doesn't seem to get rid of the morerows stuff. The troubling ones are in monitoring.sgml: LWLock Lock Activity Client IPC Timeout IO And the patch previously sent removes them, but perhaps I am missing your point? -- Michael signature.asc Description: PGP signature
Re: How to make partitioning scale better for larger numbers of partitions
On 2018/07/17 12:14, Ashutosh Bapat wrote: > On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho wrote: >> On 2018/07/17 10:49, Amit Langote wrote: >>> Perhaps, Kato-san only intended to report that the time that planner spends >>> for a partitioned table with 1100 partitions is just too high compared to >>> the time it spends on a non-partitioned table. >> >> yes, It is included for the purposes of this comparison. >> >> The purpose of this comparison is to find where the partitioning bottleneck >> is. >> Using the bottleneck as a hint of improvement, I'd like to bring the >> performance of partitioned table closer to the performance of unpartitioned >> table as much as possible. >> > > That's a good thing, but may not turn out to be realistic. We should > compare performance where partitioning matters and try to improve in > those contexts. Else we might improve performance in scenarios which > are never used. > > In this case, even if we improve the planning time by 100%, it hardly > matters since planning time is neglegible compared to the execution > time because of huge data where partitioning is useful. While I agree that it's a good idea to tell users to use partitioning only if the overhead of having the partitioning in the first place is bearable, especially the planner overhead, this benchmark shows us that even for what I assume might be fairly commonly occurring queries (select .. from / update .. / delete from partitioned_table where partkey = ?), planner spends way too many redundant cycles. Some amount of that overhead will always remain and planning with partitioning will always be a bit slower than without partitioning, it's *too* slow right now for non-trivial number of partitions. Thanks, Amit
Re: [HACKERS] Restricting maximum keep segments by repslots
On Fri, Jul 13, 2018 at 5:40 PM, Kyotaro HORIGUCHI wrote: > Hello. > > At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada > wrote in >> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI >> wrote: > .. >> Here is review comments of v4 patches. >> >> + if (minKeepLSN) >> + { >> + XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN(); >> + Assert(!XLogRecPtrIsInvalid(slotPtr)); >> + >> + tailSeg = GetOldestKeepSegment(currpos, slotPtr); >> + >> + XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, >> wal_segment_size); >> + } >> >> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the >> destination at 4th argument the wal_segment_size will be changed in >> the above expression. The regression tests by PostgreSQL Patch Tester > > I'm not sure I get this correctly, the definition of the macro is > as follows. > > | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \ > | (dest) = (segno) * (wal_segsz_bytes) + (offset) > > The destination is the *3rd* parameter and the forth is segment > size which is not to be written. Please see commit a22445ff0b which flipped input and output arguments. So maybe you need to rebase the patches to current HEAD. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: pgsql: Allow UNIQUE indexes on partitioned tables
On 2018-Feb-19, David G. Johnston wrote: > As an aside, adding a link to "Data Definiton/Table Partitioning" from at > least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and > "PARTITION OF" in the Parameters section of that page - one must partition > by a table before one can partition it (and also the synopsis lists them in > the BY before OF order), would be helpful. A little late, I have done these two changes. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Allow UNIQUE indexes on partitioned tables
On 2018/07/17 13:57, Alvaro Herrera wrote: > On 2018-Feb-19, David G. Johnston wrote: > >> As an aside, adding a link to "Data Definiton/Table Partitioning" from at >> least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and >> "PARTITION OF" in the Parameters section of that page - one must partition >> by a table before one can partition it (and also the synopsis lists them in >> the BY before OF order), would be helpful. > > A little late, I have done these two changes. A belated +1. Thanks, Amit
untrusted PLs should be GRANTable
Hi all A user has raised the point that our refusal to GRANT rights to untrusted PLs is counterproductive and inconsistent with how we behave elsewhere. Yes, untrusted PLs can be escaped to gain superuser rights, often trivially. But we allow this: CREATE ROLE superme SUPERUSER NOINHERIT; GRANT superme TO me; and really, GRANTing an untrusted PL is similar. Forcing users to create their PLs as a superuser increases the routine use of superuser accounts. Most users' DDL deploy scripts will get be run as a superuser if they have to use a superuser for PL changes; they're not going to SET ROLE and RESET ROLE around the function changes. It also encourages users to make their untrusted functions SECURITY DEFINER when still owned by a superuser, which we really don't want them doing unnecessarily. In the name of making things more secure, we've made them less secure. Untrusted PLs should be GRANTable with a NOTICE or WARNING telling the admin that GRANTing an untrusted PL effectively gives the user the ability to escape to superuser. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: How to make partitioning scale better for larger numbers of partitions
On 2018/07/16 13:16, Tsunakawa-san wrote: >Thanks. And what does the comparison look like between the unpartitioned case >and various partition counts? What's the performance characteristics in terms >of the latency and partition count? I thought that's what you tried to >reveal first? In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), where n is number of records. Latency of INSERT statements is close to O(1). In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement is close to O(log n), where n is the number of partitions. Between 400 and 6400 partitions, latency is close to O(n). Up to 400 partitions, latency of UPDATE statement is close to O(n). Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2). Details are as follows. unpartitioned table result (prepared mode) -- scale | latency_avg | tps_ex| update_latency | select_latency | insert_latency ---+-+-+++ 100 |0.24 | 4174.395738 | 0.056 | 0.051 | 0.04 200 | 0.258 | 3873.099014 | 0.065 | 0.059 | 0.04 400 |0.29 | 3453.171112 | 0.081 | 0.072 | 0.041 800 | 0.355 | 2814.936942 | 0.112 | 0.105 | 0.041 1600 | 0.493 | 2027.702689 | 0.18 | 0.174 | 0.042 3200 | 0.761 | 1313.532458 | 0.314 | 0.307 | 0.043 6400 | 1.214 | 824.001431 | 0.54 | 0.531 | 0.043 partitioned talble result (prepared mode) - num_part | latency_avg | tps_ex| update_latency | select_latency | insert_latency --+-+-+++ 100 | 0.937 | 1067.473258 | 0.557 | 0.087 | 0.123 200 |1.65 | 606.244552 | 1.115 | 0.121 | 0.188 400 | 3.295 | 303.491681 | 2.446 | 0.19 | 0.322 800 | 8.102 | 123.422456 | 6.553 | 0.337 | 0.637 1600 | 36.996 | 27.030027 | 31.528 | 1.621 | 2.455 3200 | 147.998 |6.756922 |136.136 | 4.21 | 4.94 6400 | 666.947 |1.499383 |640.879 | 7.631 | 9.642 regards, -Original Message- From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] Sent: Monday, July 16, 2018 1:16 PM To: Kato, Sho/加藤 翔 Cc: 'Amit Langote' ; PostgreSQL mailing lists Subject: RE: How to make partitioning scale better for larger numbers of partitions From: Kato, Sho [mailto:kato-...@jp.fujitsu.com] > I did pgbench -M prepared and perf record. > > UPDATE latency in prepared mode is 95% shorter than in simple mode. > SELECT latency in prepared mode is 54% shorter than in simple mode. > INSERT latency in prepared mode is 8% shorter than in simple mode. Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's the performance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first? Regards Takayuki Tsunakawa
Re: Add SKIP LOCKED to VACUUM and ANALYZE
On Thu, Jul 12, 2018 at 02:37:28PM +0900, Michael Paquier wrote: > On Wed, Jun 13, 2018 at 08:29:12PM +, Bossart, Nathan wrote: >> Previous thread: >> https://postgr.es/m/4BC0F3CD-F4B5-4F23-AADB-80607F9E4B4E%40amazon.com >> >> This is a new thread for tracking the work to add SKIP LOCKED to >> VACUUM and ANALYZE. I've attached a rebased set of patches. > > I am beginning to look at this stuff, and committed patch 4 and 5 on the > way as they make sense independently. I am still reviewing the rest, > which applies with some fuzz, and the tests proposed still pass. I have been looking at the rest of the patch set, and I find that the way SKIP_LOCKED is handled is a bit inconsistent. First, if the manual VACUUM does not specify a list of relations, which can never happen for autovacuum, then we get to use get_all_vacuum_rels to decide the list of relations to look at, and then it is up to vacuum_rel() to decide if a relation can be skipped or not. If a list of relation is given by the caller, then it is up to expand_vacuum_rel() to do the call. In expand_vacuum_rel(), an OID could be defined for a relation, which is something used by autovacuum. If no OID is defined, which happens for a manual VACUUM, then we use directly RangeVarGetRelidExtended() at this stage and see if the relation is available. If the relation listed in the manual VACUUM is a partitioned table, then its full set of partition OIDs (including down layers), are included in the list of relations to include. And we definitely want to hold, then release once AccessShareLock so as the partition tree lookup can happen. This uses as well find_all_inheritors() with NoLock so as we rely on vacuum_rel() to skip the relation or not. The first thing which is striking me is that we may actually *not* want to check for lock skipping within expand_vacuum_rel() as that's mainly a function aimed at building the relations which are going to be vacuumed, and that all the lock skipping operations should happen at the beginning of vacuum_rel(). This way, even if the parent of a partition tree is listed but cannot have its RangeVar opened immediately, then we have a chance to have some of its partitions to be vacuumed after listing them. This point is debatable as there are pros and cons. I would be of the opinion to not change expand_vacuum_rel()'s mission to build the list of relations to VACUUM, and actually complain about a lock not available when the relation is ready to be processed individually. It is also perfectly possible to skip locks for partitions by listing them individually in the VACUUM command used. I am pretty sure that Andres would complain at the sight of this paragraph as this is backwards with the reason behind the refactoring of RangeVarGetRelidExtended but I like the new API better :p For this part, it seems to me that we can do better than what is in HEAD: - Call RangeVarGetRelidExtended without lock. - Check for has_subclass(), which should be extended so as it does not complain because of a missing relation so as vacuum.c does the error handling. - Call RangeVarGetRelidExtended a second time if a lookup with find_all_inheritors is necessary. If the relation goes missing in-between then we just get an error as the current behavior. I am also questioning the fact of complicating vac_open_indexes() by skipping a full relation vacuum if one of its indexes cannot be opened, which would be the case for an ALTER INDEX for example. It seems to me that skipping locks should just happen for the parent relation, so I would not bother much about this case, and just document the behavior. If somebody argues for this option, we could always have a SKIP_INDEXES or such. "SKIP LOCKED" should be named "SKIP_LOCKED", with an underscore for consistency with the other options. -void -cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose) +bool +cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose, -bool skip_locked) { Some refactoring of CLUSTER on the way here? It would be nice to avoid having three boolean arguments here, and opens the door for an extended CLUSTER grammar with parenthesis as well as a SKIP_LOCKED option for it. And this could be refactored first. VACUUM FULL being a variant of CLUSTER, we could just reuse the same options... Now using separate option names could be cleaner. The stuff of get_elevel_for_skipped_relation could be refactored into something used as well by cluster_rel as the same logic is used in three places (vacuum_rel, analyze_rel and cluster_rel with try_relation_open). I think that it would be most interesting to get the refactoring around get_elevel_for_skip_locked and cluster_rel done first. The improvement for RangeVarGetRelidExtended with relations not having subclasses could be worth done separately as well. -- Michael signature.asc Description: PGP signature
Re: missing toast table for pg_policy
On Sat, Jul 14, 2018 at 03:47:38PM +0700, John Naylor wrote: > I hope you don't mind, but since it might be tedious to piece together > the addenda I left behind in this thread, I thought it might be useful > to update Joe's patch. The attached was rebased over the new > regression test, passes the pg_upgrade test, and has a draft commit > message. I was just having a second look at this patch, and did a bit more tests with pg_upgrade which passed. +-- 2. pg_largeobject and pg_largeobject_metadata, to avoid problems +-- with pg_upgrade John, what's actually the failure that was seen here? It would be nice to see this patch committed but the reason here should be more explicit about why this cannot happen. -- Michael signature.asc Description: PGP signature
Re: file cloning in pg_upgrade and CREATE DATABASE
On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote: > > The setting always requires the use of relinks. If > they are not supported, the pg_upgrade run > will abort. Use this in production to limit the upgrade run time. > The setting auto uses reflinks when available, > otherwise it falls back to a normal copy. This is the default. The > setting never prevents use of reflinks and always > uses a normal copy. This can be useful to ensure that the upgraded > cluster has its disk space fully allocated and not shared with the old > cluster. > Hm... I am wondering if we actually want the "auto" mode where we make the option smarter automatically. I am afraid of users trying to use it and being surprised that there is no gain while they expected some. I would rather switch that to an on/off switch, which defaults to "off", or simply what is available now. huge_pages=try was a bad idea as the result is not deterministic, so I would not have more of that... Putting CloneFile and check_reflink in a location that other frontend binaries could use would be nice, like pg_rewind. -- Michael signature.asc Description: PGP signature