speed up unicode normalization quick check
Hi, Attached is a patch to use perfect hashing to speed up Unicode normalization quick check. 0001 changes the set of multipliers attempted when generating the hash function. The set in HEAD works for the current set of NFC codepoints, but not for the other types. Also, the updated multipliers now all compile to shift-and-add on most platform/compiler combinations available on godbolt.org (earlier experiments found in [1]). The existing keyword lists are fine with the new set, and don't seem to be very picky in general. As a test, it also successfully finds a function for the OS "words" file, the "D" sets of codepoints, and for sets of the first n built-in OIDs, where n > 5. 0002 builds on top of the existing normprops infrastructure to use a hash function for NFC quick check. Below are typical numbers in a non-assert build: select count(*) from (select md5(i::text) as t from generate_series(1,10) as i) s where t is nfc normalized; HEAD 411ms 413ms 409ms patch 296ms 297ms 299ms The addition of "const" was to silence a compiler warning. Also, I changed the formatting of the output file slightly to match pgindent. 0003 uses hashing for NFKC and removes binary search. This is split out for readability. I gather NFKC is a less common use case, so this could technically be left out. Since this set is larger, the performance gains are a bit larger as well, at the cost of 19kB of binary space: HEAD 439ms 440ms 442ms patch 299ms 301ms 301ms I'll add this to the July commitfest. [1] https://www.postgresql.org/message-id/cacpnzcuvtilhxazxp9ucehguyhma59h6_pmp+_w-szxg0uy...@mail.gmail.com -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services v1-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch Description: Binary data v1-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch Description: Binary data v1-0003-Use-perfect-hashing-for-NFKC-Unicode-normalizatio.patch Description: Binary data
Re: Planning counters in pg_stat_statements (using pgss_store)
On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote: > On Tue, May 19, 2020 at 4:29 AM Andy Fan wrote: >> Thanks for the excellent extension. I want to add 5 more fields to satisfy >> the >> following requirements. >> >> int subplan; /* No. of subplan in this query */ >> int subquery; /* No. of subquery */ >> int joincnt; /* How many relations are joined */ >> bool hasagg; /* if we have agg function in this query */ >> bool hasgroup; /* has group clause */ > > Most of those fields can be computed using the raw sql satements. Why > not adding functions like query_has_agg(querytext) to get the > information from pgss stored query text instead? Yeah I personally find concepts related only to the query string itself not something that needs to be tied to pg_stat_statements. While reading about those five new fields, I am also wondering how this stuff would work with CTEs. Particularly, should the hasagg or hasgroup flags be set only if the most outer query satisfies a condition? What if an inner query satisfies a condition but not an outer query? Should joincnt just be the sum of all the joins done in all queries, including subqueries? -- Michael signature.asc Description: PGP signature
Re: explicit_bzero for sslpassword
On Wed, May 20, 2020 at 10:06:55AM +0200, Peter Eisentraut wrote: > Looks correct to me. Thanks for confirming, Peter. Got this one applied. -- Michael signature.asc Description: PGP signature
Re: Planning counters in pg_stat_statements (using pgss_store)
Le jeu. 21 mai 2020 à 09:17, Michael Paquier a écrit : > On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote: > > On Tue, May 19, 2020 at 4:29 AM Andy Fan > wrote: > >> Thanks for the excellent extension. I want to add 5 more fields to > satisfy the > >> following requirements. > >> > >> int subplan; /* No. of subplan in this query */ > >> int subquery; /* No. of subquery */ > >> int joincnt; /* How many relations are joined */ > >> bool hasagg; /* if we have agg function in this query */ > >> bool hasgroup; /* has group clause */ > > > > Most of those fields can be computed using the raw sql satements. Why > > not adding functions like query_has_agg(querytext) to get the > > information from pgss stored query text instead? > > Yeah I personally find concepts related only to the query string > itself not something that needs to be tied to pg_stat_statements. > While reading about those five new fields, I am also wondering how > this stuff would work with CTEs. Particularly, should the hasagg or > hasgroup flags be set only if the most outer query satisfies a > condition? What if an inner query satisfies a condition but not an > outer query? Should joincnt just be the sum of all the joins done in > all queries, including subqueries? > Indeed cte will bring additional concerns about the fields semantics. That's another good reason to go with external functions so you can add extra parameters for that if needed. >
Re: [Proposal] Page Compression for OLTP
Hello, My 0.02€, some of which may just show some misunderstanding on my part: - you have clearly given quite a few thoughts about the what and how… which makes your message an interesting read. - Could this be proposed as some kind of extension, provided that enough hooks are available? ISTM that foreign tables and/or alternative storage engine (aka ACCESS METHOD) provide convenient APIs which could fit the need for these? Or are they not appropriate? You seem to suggest that there are not. If not, what could be done to improve API to allow what you are seeking to do? Maybe you need a somehow lower-level programmable API which does not exist already, or at least is not exported already, but could be specified and implemented with limited effort? Basically you would like to read/write pg pages to somewhere, and then there is the syncing issue to consider. Maybe such a "page storage" API could provide benefit for some specialized hardware, eg persistent memory stores, so there would be more reason to define it anyway? I think it might be valuable to give it some thoughts. - Could you maybe elaborate on how your plan differs from [4] and [5]? - Have you consider keeping page headers and compressing tuple data only? - I'm not sure there is a point in going below the underlying file system blocksize, quite often 4 KiB? Or maybe yes? Or is there a benefit to aim at 1/4 even if most pages overflow? - ISTM that your approach entails 3 "files". Could it be done with 2? I'd suggest that the possible overflow pointers (coa) could be part of the headers so that when reading the 3.1 page, then the header would tell where to find the overflow 3.2, without requiring an additional independent structure with very small data in it, most of it zeros. Possibly this is not possible, because it would require some available space in standard headers when the is page is not compressible, and there is not enough. Maybe creating a little room for that in existing headers (4 bytes could be enough?) would be a good compromise. Hmmm. Maybe the approach I suggest would only work for 1/2 compression, but not for other target ratios, but I think it could be made to work if the pointer can entail several blocks in the overflow table. - If one page is split in 3 parts, could it creates problems on syncing, if 1/3 or 2/3 pages get written, but maybe that is manageable with WAL as it would note that the page was not synced and that is enough for replay. - I'm unclear how you would manage the 2 representations of a page in memory. I'm afraid that a 8 KiB page compressed to 4 KiB would basically take 12 KiB, i.e. reduce the available memory for caching purposes. Hmmm. The current status is that a written page probably takes 16 KiB, once in shared buffers and once in the system caches, so it would be an improvement anyway. - Maybe the compressed and overflow table could become bloated somehow, which would require the vaccuuming implementation and add to the complexity of the implementation? - External tools should be available to allow page inspection, eg for debugging purposes. -- Fabien.
Re: Is it useful to record whether plans are generic or custom?
At Thu, 21 May 2020 12:18:16 +0900, Fujii Masao wrote in > > > On 2020/05/20 21:56, Atsushi Torikoshi wrote: > > On Wed, May 20, 2020 at 1:32 PM Kyotaro Horiguchi > > mailto:horikyota@gmail.com>> wrote: > > At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi > > mailto:ato...@gmail.com>> wrote in > > > On Sat, May 16, 2020 at 6:01 PM legrand legrand > > > mailto:legrand_legr...@hotmail.com>> > > > wrote: > > > > > > BTW, I'd also appreciate other opinions about recording the number > > > of generic and custom plans on pg_stat_statemtents. > > If you/we just want to know how a prepared statement is executed, > > couldn't we show that information in pg_prepared_statements view? > > =# select * from pg_prepared_statements; > > -[ RECORD 1 ]---+ > > name | stmt1 > > statement | prepare stmt1 as select * from t where b = $1; > > prepare_time | 2020-05-20 12:01:55.733469+09 > > parameter_types | {text} > > from_sql | t > > exec_custom | 5 <- existing num_custom_plans > > exec_total | 40 <- new member of CachedPlanSource > > Thanks, Horiguchi-san! > > Adding counters to pg_prepared_statements seems useful when we want > > to know the way prepared statements executed in the current session. > > I like the idea exposing more CachedPlanSource fields in > pg_prepared_statements. I agree it's useful, e.g., for the debug > purpose. > This is why I implemented the similar feature in my extension. > Please see [1] for details. Thanks. I'm not sure plan_cache_mode should be a part of the view. Cost numbers would look better if it is cooked a bit. Is it worth being in core? =# select * from pg_prepared_statements; -[ RECORD 1 ]---+ name| p1 statement | prepare p1 as select a from t where a = $1; prepare_time| 2020-05-21 15:41:50.419578+09 parameter_types | {integer} from_sql| t calls | 7 custom_calls| 5 plan_generation | 6 generic_cost| 4.3105 custom_cost | 9.31 Perhaps plan_generation is not needed there. > > And I also feel adding counters to pg_stat_statements will be > > convenient > > especially in production environments because it enables us to get > > information about not only the current session but all sessions of a > > PostgreSQL instance. > > +1 Agreed. It is global and persistent. At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi wrote in > Instead, I'm now considering using a static hash for prepared queries > (static HTAB *prepared_queries). That might be complex and fragile considering nested query and SPI calls. I'm not sure, but could we use ActivePortal? ActivePortal->cplan is a CachedPlan, which can hold the generic/custom information. Instead, > [1] > https://github.com/MasaoFujii/pg_cheat_funcs#record-pg_cached_plan_sourcestmt-text regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 1c6a3dd41a59504244134ee44ddd4516191656da Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 May 2020 15:32:38 +0900 Subject: [PATCH] Expose counters of plancache to pg_prepared_statements We didn't have an easy way to find how many times generic or custom plans of a prepared statement have been executed. This patch exposes such numbers and costs of both plans in pg_prepared_statements. --- doc/src/sgml/catalogs.sgml| 45 +++ src/backend/commands/prepare.c| 30 -- src/backend/utils/cache/plancache.c | 13 src/include/catalog/pg_proc.dat | 6 ++-- src/include/utils/plancache.h | 5 +-- src/test/regress/expected/prepare.out | 43 + src/test/regress/expected/rules.out | 9 -- src/test/regress/sql/prepare.sql | 9 ++ 8 files changed, 144 insertions(+), 16 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index b1b077c97f..950ed30694 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -10826,6 +10826,51 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx frontend/backend protocol + + + + calls bigint + + + Number of times the prepared statement was executed + + + + + + custom_calls bigint + + + Number of times generic plan is executed for the prepared statement + + + + + + plan_geneation bigint + + + Number of times execution plan was generated for the prepared statement + + + + + + generic_cost double precision + + + Estimated cost of generic plan. NULL if not yet planned. + + + + + + custom_cost double precision + + + Estimated cost of custom plans.
Re: [bug] Table not have typarray when created by single user mode
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:not tested I verified and found no problems.
Re: Is it useful to record whether plans are generic or custom?
Hi Torikoshi-san! On 2020/05/21 17:10, Kyotaro Horiguchi wrote: At Thu, 21 May 2020 12:18:16 +0900, Fujii Masao wrote in On 2020/05/20 21:56, Atsushi Torikoshi wrote: On Wed, May 20, 2020 at 1:32 PM Kyotaro Horiguchi mailto:horikyota@gmail.com>> wrote: At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi mailto:ato...@gmail.com>> wrote in > On Sat, May 16, 2020 at 6:01 PM legrand legrand > mailto:legrand_legr...@hotmail.com>> > wrote: > > BTW, I'd also appreciate other opinions about recording the number > of generic and custom plans on pg_stat_statemtents. If you/we just want to know how a prepared statement is executed, couldn't we show that information in pg_prepared_statements view? =# select * from pg_prepared_statements; -[ RECORD 1 ]---+ name| stmt1 statement | prepare stmt1 as select * from t where b = $1; prepare_time| 2020-05-20 12:01:55.733469+09 parameter_types | {text} from_sql| t exec_custom | 5<- existing num_custom_plans exec_total | 40 <- new member of CachedPlanSource Thanks, Horiguchi-san! Adding counters to pg_prepared_statements seems useful when we want to know the way prepared statements executed in the current session. I like the idea exposing more CachedPlanSource fields in pg_prepared_statements. I agree it's useful, e.g., for the debug purpose. This is why I implemented the similar feature in my extension. Please see [1] for details. Thanks. I'm not sure plan_cache_mode should be a part of the view. Cost numbers would look better if it is cooked a bit. Is it worth being in core? =# select * from pg_prepared_statements; -[ RECORD 1 ]---+ name| p1 statement | prepare p1 as select a from t where a = $1; prepare_time| 2020-05-21 15:41:50.419578+09 parameter_types | {integer} from_sql| t calls | 7 custom_calls| 5 plan_generation | 6 generic_cost| 4.3105 custom_cost | 9.31 Perhaps plan_generation is not needed there. I tried to creating PoC patch too, so I share it. Please find attached file. # Test case prepare count as select count(*) from pg_class where oid >$1; execute count(1); select * from pg_prepared_statements; -[ RECORD 1 ]---+-- name| count statement | prepare count as select count(*) from pg_class where oid >$1; prepare_time| 2020-05-21 17:41:16.134362+09 parameter_types | {oid} from_sql| t is_generic_plan | f <= False You can see the following result, when you execute it 6 times. -[ RECORD 1 ]---+-- name| count statement | prepare count as select count(*) from pg_class where oid >$1; prepare_time| 2020-05-21 17:41:16.134362+09 parameter_types | {oid} from_sql| t is_generic_plan | t <= True Thanks, Tatsuro Yamada diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c index 80d6df8ac1..63de4fdb78 100644 --- a/src/backend/commands/prepare.c +++ b/src/backend/commands/prepare.c @@ -723,7 +723,7 @@ pg_prepared_statement(PG_FUNCTION_ARGS) * build tupdesc for result tuples. This must match the definition of the * pg_prepared_statements view in system_views.sql */ - tupdesc = CreateTemplateTupleDesc(5); + tupdesc = CreateTemplateTupleDesc(6); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name", TEXTOID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 2, "statement", @@ -734,6 +734,8 @@ pg_prepared_statement(PG_FUNCTION_ARGS) REGTYPEARRAYOID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 5, "from_sql", BOOLOID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "is_generic_plan", + BOOLOID, -1, 0); /* * We put all the tuples into a tuplestore in one scan of the hashtable. @@ -755,8 +757,8 @@ pg_prepared_statement(PG_FUNCTION_ARGS) hash_seq_init(&hash_seq, prepared_queries); while ((prep_stmt = hash_seq_search(&hash_seq)) != NULL) { - Datum values[5]; - boolnulls[5]; + Datum values[6]; + boolnulls[6]; MemSet(nulls, 0, sizeof(nulls)); @@ -766,6 +768,7 @@ pg_prepared_statement(PG_FUNCTION_ARGS) values[3] = build_regtype_array(prep_stmt->plansource->param_types, prep_stmt
Re: some grammar refactoring
On Tue, May 19, 2020 at 12:13 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > Here is a series of patches to do some refactoring in the grammar around > the commands COMMENT, DROP, SECURITY LABEL, and ALTER EXTENSION ... > ADD/DROP. In the grammar, these commands (with some exceptions) > basically just take a reference to an object and later look it up in C > code. Some of that was already generalized individually for each > command (drop_type_any_name, drop_type_name, etc.). This patch combines > it into common lists for all these commands. > > Advantages: > > - Avoids having to list each object type at least four times. > > - Object types not supported by security labels or extensions are now > explicitly listed and give a proper error message. Previously, this was > just encoded in the grammar itself and specifying a non-supported object > type would just give a parse error. > > - Reduces lines of code in gram.y. > > - Removes some old cruft. > > I liked the idea. I had quick glance through the patches and also did quick review and testing. I haven't found any issue with the patch. -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Rushabh Lathia
Re: pg13: xlogreader API adjust
At Fri, 15 May 2020 19:24:28 -0400, Alvaro Herrera wrote in > On 2020-May-15, Michael Paquier wrote: > > > On Thu, May 14, 2020 at 02:12:25PM +0900, Kyotaro Horiguchi wrote: > > > Good catch! That's not only for CreateDecodingContet. That happens > > > everywhere in the query loop in PostgresMain() until logreader is > > > initialized. So that also happens, for example, by starting logical > > > replication using invalidated slot. Checking xlogreader != NULL in > > > WalSndErrorCleanup is sufficient. It doesn't make actual difference, > > > but the attached explicitly initialize the pointer with NULL. > > > > Alvaro, are you planning to look at that? Should we have an open item > > for this matter? > > On it now. I'm trying to add a test for this (needs a small change to > PostgresNode->psql), but I'm probably doing something stupid in the Perl > side, because it doesn't detect things as well as I'd like. Still > trying, but I may be asked to evict the office soon ... FWIW, and I'm not sure which of the mail and the commit 1d3743023e was earlier, but I confirmed that the committed test in 006_logical_decoding.pl causes a crash, and the crash is fixed by the change of code. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: proposal: schema variables
Hi just rebase without any other changes Regards Pavel
Re: proposal: schema variables
On Thu, May 21, 2020 at 3:41 PM Pavel Stehule wrote: > > Hi > > just rebase without any other changes > You seem to forget attaching the rebased patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PostgreSQL 13 Beta 1 Release Announcement Draft
Hi John, On 5/21/20 12:12 AM, John Naylor wrote: > Hi Jon, > > I noticed a couple minor inconsistencies: > > ".datetime" -> elsewhere functions are formatted as `.datetime()` > > libpq -> `libpq` > > The link to the release notes on its own line is the same as the > inline link, if that makes sense. In other places with links on their > own line, the full URL is in the link text. > > Also, for "indexes that contain many repeat values", "repeated" might > sound better here. It's one of those things that jumped out at me at > first reading, but when trying both in my head, it seems ok. > > Regarding "streaming `pg_basebackup`s", I'm used to the general term > "base backups" in this usage, which seems a distinct concept from the > name of the invoked command. Thanks for the suggestions. I ended up incorporating all of them. Stay tuned for the release... Jonathan signature.asc Description: OpenPGP digital signature
Behaviour of failed Primary
Hi Forum, If I have a cluster with Synchronous replication enabled with three nodes, for eg: [primary] [hot stand by 1] [host stand by 2] And for some unforeseen reasons, if primary fails, the failover will kick in and hot stand by 1 will become new primary and cluster setup will look like this [new primary (hot stand by1)] [host stand by 2] My question here is, what will happen if the original primary which has failed comes back. Will it become part of this high available replica cluster automatically or it will be stale and disconnected from the cluster? How can we automatically make the failed primary to be part of the cluster with hot standby role? It would be of great help, if you can direct me to any references details. Thank you, upfront. Regards, Santhosh
Re: proposal: schema variables
čt 21. 5. 2020 v 13:34 odesílatel Amit Kapila napsal: > On Thu, May 21, 2020 at 3:41 PM Pavel Stehule > wrote: > > > > Hi > > > > just rebase without any other changes > > > > You seem to forget attaching the rebased patch. > I am sorry attached. Pavel > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > schema-variables-20200521.patch.gz Description: application/gzip
Re: factorial function/phase out postfix operators?
On Wed, May 20, 2020 at 2:24 PM Tom Lane wrote: > Right; I'd done the same arithmetic. Since we currently have a total > of 450 keywords of all flavors, that means we can make either 64% > of them or 74.6% of them be safe to use as bare column labels. While > that's surely better than today, it doesn't seem like it's going to > make for any sort of sea change in the extent of the problem. So I was > feeling a bit discouraged by these results. I don't think you should feel discouraged by these results. They assume that people are just as likely to have a problem with a reserved keyword as an unreserved keyword, and I don't think that's actually true. The 25.4% of keywords that aren't handled this way include, to take a particularly egregious example, "AS" itself. And I don't think many people are going to be sad if "select 1 as;" fails to treat "as" as a column label. Also, even if we only made 74.6% of these safe to use as bare column labels, or even 64%, I think that's actually pretty significant. If I could reduce my mortgage payment by 64%, I would be pretty happy. For many people, that would be a sufficiently large economic impact that it actually would be a sea change in terms of their quality of life. I don't see a reason to suppose that's not also true here.[1] I do like the idea of considering "can be a bare column label" as an independent dimension from the existing keyword classification. Presumably we would then have, in addition to the four existing keyword productions, but then also a separate bare_column_label_keyword: production that would include many of the same keywords. One nice thing about that approach is that we would then have a clear list of exactly which keywords can't be given that treatment, and if somebody wanted to go investigate possible improvements for any of those, they could do so. I think we'd want a cross-check: check_keywords.pl should contain the list of keywords that are expected to be excluded from this new production, so that any time someone adds a new keyword, they've either got to add it to the new production or add it to the exception list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] On the other hand, if I had 64% fewer ants in my picnic basket, I would probably still be unhappy with the number of ants in my picnic basket, so it all depends on context and perspective.
Re: WIP/PoC for parallel backup
On Thu, May 21, 2020 at 2:06 AM Rushabh Lathia wrote: > Yes. My colleague Suraj tried this and here are the pg_stat_activity output > files. > > Captured wait events after every 3 seconds during the backup for - > 1: parallel backup for 100GB data with 4 workers > (pg_stat_activity_normal_backup_100GB.txt) > 2: Normal backup (without parallel backup patch) for 100GB data > (pg_stat_activity_j4_100GB.txt) > > Here is the observation: > > The total number of events (pg_stat_activity) captured during above runs: > - 314 events for normal backups > - 316 events for parallel backups (-j 4) > > BaseBackupRead wait event numbers: (newly added) > 37 - in normal backups > 25 - in the parallel backup (-j 4) > > ClientWrite wait event numbers: > 175 - in normal backup > 1098 - in parallel backups > > ClientRead wait event numbers: > 0 - ClientRead in normal backup > 326 - ClientRead in parallel backups for diff processes. (all in idle state) So, basically, when we go from 1 process to 4, the additional processes spend all of their time waiting rather than doing any useful work, and that's why there is no performance benefit. Presumably, the reason they spend all their time waiting for ClientRead/ClientWrite is because the network between the two machines is saturated, so adding more processes that are trying to use it at maximum speed just leads to spending more time waiting for it to be available. Do we have the same results for the local backup case, where the patch helped? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Trouble with hashagg spill I/O pattern and costing
On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote: ... The problem is that the hashagg plan runs in ~1400 seconds, while the groupagg only takes ~360. And per explain analyze, the difference really is in the aggregation - if we subtract the seqscan, the sort+groupagg takes about 310s: -> GroupAggregate (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981 loops=1) Group Key: lineitem_1.l_partkey -> Sort (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841 loops=1) Sort Key: lineitem_1.l_partkey Sort Method: external merge Disk: 3569544kB -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1) while the hashagg takes ~1330s: -> HashAggregate (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=640 loops=1) Group Key: lineitem_1.l_partkey Planned Partitions: 128 Peak Memory Usage: 4249 kB Disk Usage: 26321840 kB HashAgg Batches: 16512 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1) And that's while only writing 26GB, compared to 35GB in the sorted plan, and with cost being ~16M vs. ~43M (so roughly inverse). I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan writes out *way less* data to the temp file. The reason is revealed by explain verbose: -> GroupAggregate Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity)) Group Key: lineitem_1.l_partkey -> Sort Output: lineitem_1.l_partkey, lineitem_1.l_quantity Sort Key: lineitem_1.l_partkey -> Seq Scan on public.lineitem lineitem_1 Output: lineitem_1.l_partkey, lineitem_1.l_quantity -> HashAggregate Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity)) Group Key: lineitem_1.l_partkey -> Seq Scan on public.lineitem lineitem_1 Output: lineitem_1.l_orderkey, lineitem_1.l_partkey, lineitem_1.l_suppkey, lineitem_1.l_linenumber, lineitem_1.l_quantity, lineitem_1.l_extendedprice, lineitem_1.l_discount, lineitem_1.l_tax, lineitem_1.l_returnflag, lineitem_1.l_linestatus, lineitem_1.l_shipdate, lineitem_1.l_commitdate, lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct, lineitem_1.l_shipmode, lineitem_1.l_comment It seems that in the hashagg case we're not applying projection in the seqscan, forcing us to serialize way much data (the whole lineitem table, essentially). It's probably still worth tweaking the I/O pattern, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Schedule of commit fests for PG14
Michael Paquier writes: > Normally $subject would have been discussed at the developer meeting > in Ottawa, but that's not going to happen per the current situation. > For the last couple of years, we have been using the same timeline for > for commit fests in a development cycle, so why not going with the > same flow this year? This would mean 5 CFs: > - 2020-07-01~2020-07-31 > - 2020-09-01~2020-09-30 > - 2020-11-01~2020-11-30 > - 2021-01-01~2021-01-31 > - 2021-03-01~2021-03-31 Yeah, nobody's expressed any great unhappiness with the schedule recently, so let's just do the same thing again this year. regards, tom lane
Re: Schedule of commit fests for PG14
On 5/21/20 2:35 AM, Michael Paquier wrote: Hi all, Normally $subject would have been discussed at the developer meeting in Ottawa, but that's not going to happen per the current situation. For the last couple of years, we have been using the same timeline for for commit fests in a development cycle, so why not going with the same flow this year? This would mean 5 CFs: - 2020-07-01~2020-07-31 - 2020-09-01~2020-09-30 - 2020-11-01~2020-11-30 - 2021-01-01~2021-01-31 - 2021-03-01~2021-03-31 Any thoughts or opinions? +1. This schedule seems to have worked fine the last two years. Regards, -- -David da...@pgmasters.net
Re: PostgreSQL 13 Beta 1 Release Announcement Draft
Congrats to all for the release of a new major version! Two questions: - Why is VACUUM together with FETCH FIRST WITH TIES, CREATE TABLE LIKE, ALTER VIEW, ALTER TABLE, etc in Utility Commands section? Shouldn't there be a separate section for SQL changes? (or keep one section but rename the Utility to include all?) > Add FOREIGN to ALTER statements, if appropriate (Luis Carril) > WHAT IS THIS ABOUT? - The "WHAT IS THIS ABOUT?" should be removed, in my opinion. Again, congrats for another release of the best database in the world. Pantelis Theodosiou On Thu, May 21, 2020 at 12:44 PM Jonathan S. Katz wrote: > Hi John, > > On 5/21/20 12:12 AM, John Naylor wrote: > > Hi Jon, > > > > I noticed a couple minor inconsistencies: > > > > ".datetime" -> elsewhere functions are formatted as `.datetime()` > > > > libpq -> `libpq` > > > > The link to the release notes on its own line is the same as the > > inline link, if that makes sense. In other places with links on their > > own line, the full URL is in the link text. > > > > Also, for "indexes that contain many repeat values", "repeated" might > > sound better here. It's one of those things that jumped out at me at > > first reading, but when trying both in my head, it seems ok. > > > > Regarding "streaming `pg_basebackup`s", I'm used to the general term > > "base backups" in this usage, which seems a distinct concept from the > > name of the invoked command. > > Thanks for the suggestions. I ended up incorporating all of them. > > Stay tuned for the release... > > Jonathan > >
Re: [PATCH] fix GIN index search sometimes losing results
Hi All! 1. Generally the difference of my patch in comparison to Tom's patch 0001 is that I tried to move previous logic of GIN's own TS_execute_ternary() to the general logic of TS_execute_recurse and in case we have index without positions to avoid diving into phrase operator replacing (only in this case) in by an AND operator. The reason for this I suppose is speed and I've done testing of some corner cases like phrase operator with big number of OR comparisons inside it. - BEFORE ANY PATCH: Bitmap Heap Scan on pglist (cost=1715.72..160233.31 rows=114545 width=1234) (actual time=652.294..2719.961 rows=4904 loops=1) Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Rows Removed by Index Recheck: 108191 Heap Blocks: exact=73789 -> Bitmap Index Scan on pglist_fts_idx (cost=0.00..1687.09 rows=114545 width=0) (actual time=636.883..636.883 rows=113095 loops=1) Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Planning Time: 3.016 ms Execution Time: *2721.002 ms* --- AFTER TOM's PATCH (0001) Bitmap Heap Scan on pglist (cost=1715.72..160233.31 rows=114545 width=1234) (actual time=916.640..2960.571 rows=4904 loops=1) Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Rows Removed by Index Recheck: 108191 Heap Blocks: exact=73789 -> Bitmap Index Scan on pglist_fts_idx (cost=0.00..1687.09 rows=114545 width=0) (actual time=900.472..900.472 rows=113095 loops=1) Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Planning Time: 2.688 ms Execution Time: *2961.704 ms* AFTER MY PATCH (gin-gist-weight-patch-v3) Bitmap Heap Scan on pglist (cost=1715.72..160233.31 rows=114545 width=1234) (actual time=616.982..2710.571 rows=4904 loops=1) Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Rows Removed by Index Recheck: 108191 Heap Blocks: exact=73789 -> Bitmap Index Scan on pglist_fts_idx (cost=0.00..1687.09 rows=114545 width=0) (actual time=601.586..601.586 rows=113095 loops=1) Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <-> ''start'''::tsquery) Planning Time: 3.115 ms Execution Time: *2711.533 ms* I've done the test several times and seems that difference is real effect, though not very big (around 7%). So maybe there is some reason to save PHRASE_AS_AND behavior for GIN-GIST indexes despite migration from GIN's own TS_execute_ternary() to general TS_execute_recurse. 2. As for shifting from bool to ternary callback I am not quite sure whether it can be useful to save bool callbacks via bool-ternary wrapper. We can include this for compatibility with old callers and can drop. Any id
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote: On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote: ... The problem is that the hashagg plan runs in ~1400 seconds, while the groupagg only takes ~360. And per explain analyze, the difference really is in the aggregation - if we subtract the seqscan, the sort+groupagg takes about 310s: -> GroupAggregate (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981 loops=1) Group Key: lineitem_1.l_partkey -> Sort (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841 loops=1) Sort Key: lineitem_1.l_partkey Sort Method: external merge Disk: 3569544kB -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1) while the hashagg takes ~1330s: -> HashAggregate (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=640 loops=1) Group Key: lineitem_1.l_partkey Planned Partitions: 128 Peak Memory Usage: 4249 kB Disk Usage: 26321840 kB HashAgg Batches: 16512 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1) And that's while only writing 26GB, compared to 35GB in the sorted plan, and with cost being ~16M vs. ~43M (so roughly inverse). I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan writes out *way less* data to the temp file. The reason is revealed by explain verbose: -> GroupAggregate Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity)) Group Key: lineitem_1.l_partkey -> Sort Output: lineitem_1.l_partkey, lineitem_1.l_quantity Sort Key: lineitem_1.l_partkey -> Seq Scan on public.lineitem lineitem_1 Output: lineitem_1.l_partkey, lineitem_1.l_quantity -> HashAggregate Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity)) Group Key: lineitem_1.l_partkey -> Seq Scan on public.lineitem lineitem_1 Output: lineitem_1.l_orderkey, lineitem_1.l_partkey, lineitem_1.l_suppkey, lineitem_1.l_linenumber, lineitem_1.l_quantity, lineitem_1.l_extendedprice, lineitem_1.l_discount, lineitem_1.l_tax, lineitem_1.l_returnflag, lineitem_1.l_linestatus, lineitem_1.l_shipdate, lineitem_1.l_commitdate, lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct, lineitem_1.l_shipmode, lineitem_1.l_comment It seems that in the hashagg case we're not applying projection in the seqscan, forcing us to serialize way much data (the whole lineitem table, essentially). It's probably still worth tweaking the I/O pattern, I think. OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched version that pre-allocates 128 blocks, and the results seem pretty nice: sort hash hash+tlist -- 4MB 331 478188 128MB 222 434210 which I guess is what we wanted ... I'll give it a try on the other machine (temp on SATA), but I don't see why would it not behave similarly nicely. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 9941dfe65e..08d43c270e 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -2118,7 +2118,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path) * Agg can project, so no need to be terribly picky about child tlist, but * we do need grouping columns to be available */ - subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST); + subplan = create_plan_recurse(root, best_path->subpath, CP_SMALL_TLIST); tlist = build_path_tlist(root, &best_path->path);
Re: Behaviour of failed Primary
On Thu, May 21, 2020 at 5:38 PM Santhosh Kumar wrote: > > Hi Forum, > If I have a cluster with Synchronous replication enabled with three nodes, > for eg: > > [primary] [hot stand by 1] [host stand by 2] > > And for some unforeseen reasons, if primary fails, the failover will kick in > and hot stand by 1 will become new primary and cluster setup will look like > this > > [new primary (hot stand by1)] [host stand by 2] > > My question here is, what will happen if the original primary which has > failed comes back. Will it become part of this high available replica cluster > automatically or it will be stale and disconnected from the cluster? > It won't become standby automatically as it would have diverged from the new master. > How can we automatically make the failed primary to be part of the cluster > with hot standby role? It would be of great help, if you can direct me to any > references details. Thank you, upfront. > I think pg_rewind can help in such situations. See the docs of pg_rewind [1]. [1] - https://www.postgresql.org/docs/devel/app-pgrewind.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 02:12:55AM +0200, Tomas Vondra wrote: ... I agree that's pretty nice. I wonder how far would we need to go before reaching a plateau. I'll try this on the other machine with temporary tablespace on SATA, but that'll take longer. OK, I've managed to get some numbers from the other machine, with 75GB data set and temp tablespace on SATA RAID. I haven't collected I/O data using iosnoop this time, because we already know how that changes from the other machine. I've also only done this with 128MB work_mem, because of how long a single run takes, and with 128 blocks pre-allocation. The patched+tlist means both pre-allocation and with the tlist tweak I've posted to this thread a couple minutes ago: master patched patched+tlist - sort 485472 462 hash 24686 3060 559 So the pre-allocation makes it 10x faster, and the tlist tweak makes it 5x faster. Not bad, I guess. Note: I've slightly tweaked read-ahead on the RAID device(s) on those patched runs, but the effect was pretty negligible (compared to other patched runs with the old read-ahead setting). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PostgreSQL 13 Beta 1 Release Announcement Draft
On Thu, May 21, 2020 at 3:20 PM Pantelis Theodosiou wrote: > Congrats to all for the release of a new major version! > > Two questions: > - Why is VACUUM together with FETCH FIRST WITH TIES, CREATE TABLE LIKE, > ALTER VIEW, ALTER TABLE, etc in Utility Commands section? > Shouldn't there be a separate section for SQL changes? (or keep one > section but rename the Utility to include all?) > > > Add FOREIGN to ALTER statements, if appropriate (Luis Carril) > > > WHAT IS THIS ABOUT? > - The "WHAT IS THIS ABOUT?" should be removed, in my opinion. > > Again, congrats for another release of the best database in the world. > > Pantelis Theodosiou > > On Thu, May 21, 2020 at 12:44 PM Jonathan S. Katz > wrote: > >> >> Thanks for the suggestions. I ended up incorporating all of them. >> >> Stay tuned for the release... >> >> Jonathan >> >> Apologies, I realized a minute too late that my comments are about the Release Notes and not the Announcement. However, since the link to Notes makes them no visible to more eyes, they could be checked again. Pantelis Theodosiou
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 10:45 AM Tomas Vondra wrote: > So the pre-allocation makes it 10x faster, and the tlist tweak makes it > 5x faster. Not bad, I guess. That is pretty great stuff, Tomas. FWIW, I agree that CP_SMALL_TLIST seems like the right thing here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_bsd_indent and -Wimplicit-fallthrough
On 18/05/2020 11.22, Julien Rouhaud wrote: On Sun, May 17, 2020 at 2:32 AM Michael Paquier wrote: On Sat, May 16, 2020 at 11:56:28AM -0400, Tom Lane wrote: In the meantime, I went ahead and pushed this to our pg_bsd_indent repo. Thanks, Tom. +1, thanks a lot! Committed upstream, thank you.
Re: POC: rational number type (fractions)
On Mon, May 18, 2020 at 6:15 PM Tom Lane wrote: > There surely are use-cases for true rational arithmetic, but I'm > dubious that it belongs in core Postgres. I don't think that enough > of our users would want it to justify expending core-project maintenance > effort on it. So I'd be happier to see this as an out-of-core extension. As is often the case, I'm a little more positive about including this than Tom, but as is also often the case, I'm somewhat cautious, too. On the one hand, I think it would be cool to have and people would like it. But, On the other hand, I also think we'd only want it if we're convinced that it's a really good implementation and that there's not a competing design which is better, or even equally good. Those things don't seem too clear at this point, so I hope Jeff and Joe keep chatting about it ... and maybe some other people who are knowledgeable about this will chime in, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote: > OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST > to > CP_SMALL_TLIST) addresses this for me. Great! There were a couple plan changes where it introduced a Subquery Scan. I'm not sure that I understand why it's doing that, can you verify that it is a reasonable thing to do? Aside from that, feel free to commit it. Regards, Jeff Davis
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote: On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote: OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to CP_SMALL_TLIST) addresses this for me. Great! There were a couple plan changes where it introduced a Subquery Scan. I'm not sure that I understand why it's doing that, can you verify that it is a reasonable thing to do? Aside from that, feel free to commit it. It's doing that because we're doing projection everywhere, even in cases when it may not be necessary - but I think that's actually OK. At first I thought we might only do it conditionally when we expect to spill to disk, but that'd not work for cases when we only realize we need to spill to disk during execution. So I think the plan changes are correct and expected. I think we should do the pre-allocation patch too. I haven't tried yet but I believe the tlist fix alone won't do nearly as good. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 08:34:05PM +0200, Tomas Vondra wrote: On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote: ... I think we should do the pre-allocation patch too. I haven't tried yet but I believe the tlist fix alone won't do nearly as good. I've done some measurements on the smaller (SSD) machine, and the comparison looks like this: sort hash hash+prealloc+tlist hash+tlist 4MB331478 188 330 128MB222434 210 350 The last column is master with the tlist tweak alone - it's better than hashagg on master alone, but it's not nearly as good as with both tlist and prealloc patches. I can't test this on the larger box with SATA temporary tablespace at the moment (other tests are running), but I believe the difference will be even more pronounced there. I don't think we're under a lot of pressure - beta1 is out anyway, so we have time to do proper testing first. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote: > The last column is master with the tlist tweak alone - it's better > than > hashagg on master alone, but it's not nearly as good as with both > tlist > and prealloc patches. Right, I certainly think we should do the prealloc change, as well. I'm tweaking the patch to be a bit more flexible. I'm thinking we should start the preallocation list size ~8 and then double it up to ~128 (depending on your results). That would reduce the waste in case we have a large number of small partitions. Regards, Jeff Davis
Re: Trouble with hashagg spill I/O pattern and costing
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote: On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote: And if there a way to pre-allocate larger chunks? Presumably we could assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x 8kB) instead of just single block. I haven't seen anything like that in tape.c, though ... It turned out to be simple (at least a POC) so I threw together a patch. I just added a 32-element array of block numbers to each tape. When we need a new block, we retrieve a block number from that array; or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times. I think the PoC patch goes in the right direction. I have two ideas how to improve it a bit: 1) Instead of assigning the pages one by one, we can easily extend the API to allow getting a range of blocks, so that we don't need to call ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange with the requested number of blocks. And we could keep just a min/max of free blocks, not an array with fixed number of elements. 2) We could make it self-tuning, by increasing the number of blocks we pre-allocate. So every time we exhaust the range, we double the number of blocks (with a reasonable maximum, like 1024 or so). Or we might just increment it by 32, or something. IIUC the danger of pre-allocating blocks is that we might not fill them, resulting in temp file much larger than necessary. It might be harmless on some (most?) current filesystems that don't actually allocate space for blocks that are never written, but it also confuses our accounting of temporary file sizes. So we should try to limit that, and growing the number of pre-allocated blocks over time seems reasonable. Both (1) and (2) seem fairly simple, not much more complex than the current PoC patch. I also wonder if we could collect / report useful statistics about I/O on the temporary file, not just the size. I mean, how many pages we've written/read, how sequential it was, etc. But some of that is probably only visible at the OS level (e.g. we have no insignt into how the kernel combines writes in page cache, etc.). This is clearly matter for v14, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 12:04:19PM -0700, Jeff Davis wrote: On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote: The last column is master with the tlist tweak alone - it's better than hashagg on master alone, but it's not nearly as good as with both tlist and prealloc patches. Right, I certainly think we should do the prealloc change, as well. I'm tweaking the patch to be a bit more flexible. I'm thinking we should start the preallocation list size ~8 and then double it up to ~128 (depending on your results). That would reduce the waste in case we have a large number of small partitions. You're reading my mind ;-) I don't think 128 is necessarily the maximum we should use - it's just that I haven't tested higher values. I wouldn't be surprised if higher values made it a bit faster. But we can test and tune that, I agree with growing the number of pre-allocted blocks over time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote: > 1) Instead of assigning the pages one by one, we can easily extend > the > API to allow getting a range of blocks, so that we don't need to call > ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange > with the requested number of blocks. ltsGetFreeBlock() just draws one element from a minheap. Is there some more efficient way to get many elements from a minheap at once? > And we could keep just a min/max > of free blocks, not an array with fixed number of elements. I don't quite know what you mean. Can you elaborate? Regards, Jeff Davis
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
> On Tue, Apr 14, 2020 at 09:09:31PM +1200, David Rowley wrote: > > The infrastructure (knowing the unique properties of a RelOptInfo), as > provided by the patch Andy has been working on, which is based on my > rough prototype version, I believe should be used for the skip scans > patch as well. Hi, Following our agreement about making skip scan patch to use UniqueKeys implementation from this thread I've rebased index skip scan on first two patches from v8 series [1] (if I understand correctly those two are introducing the concept, and others are just using it). I would like to clarify couple of things: * It seems populate_baserel_uniquekeys, which actually sets uniquekeys, is called after create_index_paths, where index skip scan already needs to use them. Is it possible to call it earlier? * Do I understand correctly, that a UniqueKey would be created in a simplest case only when an index is unique? This makes it different from what was implemented for index skip scan, since there UniqueKeys also represents potential to use non-unique index to facilitate search for unique values via skipping. [1]: https://www.postgresql.org/message-id/CAKU4AWpOM3_J-B%3DwQtCeU1TGr89MhpJBBkv2he1tAeQz6i4XNw%40mail.gmail.com
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthali...@gmail.com> wrote: > * It seems populate_baserel_uniquekeys, which actually sets uniquekeys, > is called after create_index_paths, where index skip scan already > needs to use them. Is it possible to call it earlier? Seems reasonable. I originally put it after check_index_predicates(). We need to wait until at least then so we can properly build UniqueKeys for partial indexes. > * Do I understand correctly, that a UniqueKey would be created in a > simplest case only when an index is unique? This makes it different > from what was implemented for index skip scan, since there UniqueKeys > also represents potential to use non-unique index to facilitate search > for unique values via skipping. The way I picture the two working together is that the core UniqueKey patch adds UniqueKeys to RelOptInfos to allow us to have knowledge about what they're unique on based on the base relation's unique indexes. For Skipscans, that patch must expand on UniqueKeys to teach Paths about them. I imagine we'll set some required UniqueKeys during standard_qp_callback() and then we'll try to create some Skip Scan paths (which are marked with UniqueKeys) if the base relation does not already have UniqueKeys that satisfy the required UniqueKeys that were set during standard_qp_callback(). In the future, there may be other reasons to create Skip Scan paths for certain rels, e.g if they're on the inner side of a SEMI/ANTI join, it might be useful to try those when planning joins. David
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Thu, 14 May 2020 at 14:39, Andy Fan wrote: > > On Thu, May 14, 2020 at 6:20 AM David Rowley wrote: >> Having the "onerow" flag was not how I intended it to work. >> > Thanks for the detailed explanation. So I think we do need to handle onerow > specially, (It means more things than adding each column as a UniqueKey). > but we don't need the onerow flag since we can tell it by ukey->exprs == NIL. > > During the developer of this feature, I added some Asserts as double checking > for onerow and exprs. it helps me to find some special cases. like > SELECT FROM multirows union SELECT FROM multirows; where targetlist is NIL. > (I find the above case returns onerow as well just now). so onerow flag > allows us > check this special things with more attention. Even this is not the original > intention > but looks it is the one purpose now. But surely that special case should just go in populate_unionrel_uniquekeys(). If the targetlist is empty, then add a UniqueKey with an empty set of exprs. > However I am feeling that removing onerow flag doesn't require much of code > changes. Almost all the special cases which are needed before are still needed > after that and all the functions based on that like relation_is_onerow > /add_uniquekey_onerow is still valid, we just need change the implementation. > so do you think we need to remove onerow flag totally? Well, at the moment I'm not quite understanding why it's needed. If it's not needed then we should remove it. If it turns out there is some valid reason for it, then we should keep it. David
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 12:40:23PM -0700, Jeff Davis wrote: On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote: 1) Instead of assigning the pages one by one, we can easily extend the API to allow getting a range of blocks, so that we don't need to call ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange with the requested number of blocks. ltsGetFreeBlock() just draws one element from a minheap. Is there some more efficient way to get many elements from a minheap at once? And we could keep just a min/max of free blocks, not an array with fixed number of elements. I don't quite know what you mean. Can you elaborate? Ah, I forgot there's and internal minheap thing - I thought we're just incrementing some internal counter or something like that, but with the minheap we can't just get a range of blocks. So just disregard that, you're right we need the array. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote: > 2) We could make it self-tuning, by increasing the number of blocks > we pre-allocate. So every time we exhaust the range, we double the > number of blocks (with a reasonable maximum, like 1024 or so). Or we > might just increment it by 32, or something. Attached a new version that uses the doubling behavior, and cleans it up a bit. It also returns the unused prealloc blocks back to lts- >freeBlocks when the tape is rewound for reading. > IIUC the danger of pre-allocating blocks is that we might not fill > them, > resulting in temp file much larger than necessary. It might be > harmless > on some (most?) current filesystems that don't actually allocate > space > for blocks that are never written, but it also confuses our > accounting > of temporary file sizes. So we should try to limit that, and growing > the > number of pre-allocated blocks over time seems reasonable. There's another danger here: it doesn't matter how well the filesystem deals with sparse writes, because ltsWriteBlock fills in the holes with zeros anyway. That's potentially a significant amount of wasted IO effort if we aren't careful. Regards, Jeff Davis diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index bc8d56807e2..84f9d6d2358 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -110,6 +110,8 @@ typedef struct TapeBlockTrailer #define TapeBlockSetNBytes(buf, nbytes) \ (TapeBlockGetTrailer(buf)->next = -(nbytes)) +#define TAPE_WRITE_PREALLOC_MIN 8 +#define TAPE_WRITE_PREALLOC_MAX 128 /* * This data structure represents a single "logical tape" within the set @@ -151,6 +153,21 @@ typedef struct LogicalTape int max_size; /* highest useful, safe buffer_size */ int pos; /* next read/write position in buffer */ int nbytes; /* total # of valid bytes in buffer */ + + /* + * When multiple tapes are being written to concurrently (as in HashAgg), + * avoid excessive fragmentation by preallocating block numbers to + * individual tapes. The preallocated block numbers are held in an array + * sorted in descending order; blocks are consumed from the end of the + * array (lowest block numbers first). + * + * No filesystem operations are performed for preallocation; only the + * block numbers are reserved. This may lead to sparse writes, which will + * cause ltsWriteBlock() to fill in holes with zeros. + */ + long *prealloc; + int nprealloc; /* number of elements in list */ + int prealloc_size; /* number of elements list can hold */ } LogicalTape; /* @@ -198,6 +215,7 @@ struct LogicalTapeSet static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer); static void ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer); static long ltsGetFreeBlock(LogicalTapeSet *lts); +static long ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt); static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum); static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared, SharedFileSet *fileset); @@ -397,6 +415,51 @@ ltsGetFreeBlock(LogicalTapeSet *lts) return blocknum; } +/* + * Return the lowest free block number from the tape's preallocation list. + * Refill the preallocation list if necessary. + */ +static long +ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt) +{ + /* sorted in descending order, so return the last element */ + if (lt->nprealloc > 0) + return lt->prealloc[--lt->nprealloc]; + + if (lt->prealloc == NULL) + { + lt->prealloc_size = TAPE_WRITE_PREALLOC_MIN; + lt->prealloc = (long *) palloc(sizeof(long) * lt->prealloc_size); + } + else + { + /* when the preallocation list runs out, double the size */ + if (lt->prealloc_size * 2 <= TAPE_WRITE_PREALLOC_MAX) + lt->prealloc_size *= 2; + lt->prealloc = (long *) repalloc(lt->prealloc, + sizeof(long) * lt->prealloc_size); + } + + /* refill preallocation list */ + lt->nprealloc = lt->prealloc_size; + for (int i = lt->nprealloc; i > 0; i--) + lt->prealloc[i - 1] = ltsGetFreeBlock(lts); + +#ifdef USE_ASSERT_CHECKING + { + /* verify that list is in reverse sorted order */ + long last_block_number = -1; + for (int i = lt->nprealloc; i > 0; i--) + { + Assert(lt->prealloc[i - 1] > last_block_number); + last_block_number = lt->prealloc[i - 1]; + } + } +#endif + + return lt->prealloc[--lt->nprealloc]; +} + /* * Return a block# to the freelist. */ @@ -557,6 +620,9 @@ ltsInitTape(LogicalTape *lt) lt->max_size = MaxAllocSize; lt->pos = 0; lt->nbytes = 0; + lt->prealloc = NULL; + lt->nprealloc = 0; + lt->prealloc_size = 0; } /* @@ -709,7 +775,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum, Assert(lt->firstBlockNumber == -1); Assert(lt->pos == 0); - lt->curBlockNumber = ltsGetFreeBlock(lts); + lt->curBlockNumber = ltsGetPreallocBlock(lts, lt); lt->firstBlockNumber = lt->curBlockNumber; TapeBlockGetTrail
Re: pgindent && weirdness
On 16/01/2020 03.59, Thomas Munro wrote: On Wed, Jan 15, 2020 at 11:30 AM Tom Lane wrote: Alvaro Herrera writes: I just ran pgindent over some patch, and noticed that this hunk ended up in my working tree: - if (IsA(leftop, Var) && IsA(rightop, Const)) + if (IsA(leftop, Var) &&IsA(rightop, Const)) Yeah, it's been doing that for decades. I think the triggering factor is the typedef name (Var, here) preceding the &&. It'd be nice to fix properly, but I've tended to take the path of least resistance by breaking such lines to avoid the ugliness: if (IsA(leftop, Var) && IsA(rightop, Const)) I am on vacation away from the Internet this week but somehow saw this on my phone and couldn't stop myself from peeking at pg_bsd_ident again. Yeah, "(Var)" (where Var is a known typename) causes it to think that any following operator must be unary. One way to fix that in the cases Alvaro is referring to is to tell override the setting so that && (and likewise ||) are never considered to be unary, though I haven't tested this much and there are surely other ways to achieve this: diff --git a/lexi.c b/lexi.c index d43723c..6de3227 100644 --- a/lexi.c +++ b/lexi.c @@ -655,6 +655,12 @@ stop_lit: unary_delim = state->last_u_d; break; } + + /* && and || are never unary */ + if ((token[0] == '&' && *buf_ptr == '&') || + (token[0] == '|' && *buf_ptr == '|')) + state->last_u_d = false; + while (*(e_token - 1) == *buf_ptr || *buf_ptr == '=') { /* * handle ||, &&, etc, and also things as in int *i The problem with that is that && sometimes *should* be formatted like a unary operator: when it's part of the nonstandard GCC computed goto syntax. These comments are made in the context of pushing this change or equivalent to FreeBSD repository. I think this is a better approach then the one committed to pg_bsd_indent. It's ubiquitous that the operators are binary, except - as you mentioned - in a nonstandard GCC syntax. The alternative given has more disadvantages, with potential impact on FreeBSD code formatting, which it should support as well as everything else -- to a reasonable extent. sys/kern/ is usually a decent litmus test, but I don't claim it should show anything interesting in this particular case. This change may seem hacky, but it would be far from the worst hack in this program's history or even in its present form. It's actually very much in indent's spirit, which is an attribute I neither support nor condemn. In any case, this change, or equivalent, should be committed to FreeBSD repository together with a test case or two.
Re: Trouble with hashagg spill I/O pattern and costing
On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote: On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote: 2) We could make it self-tuning, by increasing the number of blocks we pre-allocate. So every time we exhaust the range, we double the number of blocks (with a reasonable maximum, like 1024 or so). Or we might just increment it by 32, or something. Attached a new version that uses the doubling behavior, and cleans it up a bit. It also returns the unused prealloc blocks back to lts- freeBlocks when the tape is rewound for reading. Ah, the returning is a nice idea, that should limit the overhead quite a bit, I think. IIUC the danger of pre-allocating blocks is that we might not fill them, resulting in temp file much larger than necessary. It might be harmless on some (most?) current filesystems that don't actually allocate space for blocks that are never written, but it also confuses our accounting of temporary file sizes. So we should try to limit that, and growing the number of pre-allocated blocks over time seems reasonable. There's another danger here: it doesn't matter how well the filesystem deals with sparse writes, because ltsWriteBlock fills in the holes with zeros anyway. That's potentially a significant amount of wasted IO effort if we aren't careful. True. I'll give it a try on both machines and report some numbers. Might take a couple of days. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel Seq Scan vs kernel read ahead
On Thu, 21 May 2020 at 17:06, David Rowley wrote: > For the patch. I know you just put it together quickly, but I don't > think you can do that ramp up the way you have. It looks like there's > a risk of torn reads and torn writes and I'm unsure how much that > could affect the test results here. Oops. On closer inspection, I see that memory is per worker, not global to the scan.
Re: Parallel Seq Scan vs kernel read ahead
On Fri, May 22, 2020 at 10:00 AM David Rowley wrote: > On Thu, 21 May 2020 at 17:06, David Rowley wrote: > > For the patch. I know you just put it together quickly, but I don't > > think you can do that ramp up the way you have. It looks like there's > > a risk of torn reads and torn writes and I'm unsure how much that > > could affect the test results here. > > Oops. On closer inspection, I see that memory is per worker, not > global to the scan. Right, I think it's safe. I think you were probably right that ramp-up isn't actually useful though, it's only the end of the scan that requires special treatment so we don't get unfair allocation as the work runs out, due to course grain. I suppose that even if you have a scheme that falls back to fine grained allocation for the final N pages, it's still possible that a highly distracted process (most likely the leader given its double duties) can finish up sitting on a large range of pages and eventually have to process them all at the end after the other workers have already knocked off and gone for a pint.
Re: pgindent && weirdness
Piotr Stefaniak writes: > On 16/01/2020 03.59, Thomas Munro wrote: >> One way to fix that in the cases Alvaro is referring to is to tell >> override the setting so that && (and likewise ||) are never considered >> to be unary, though I haven't tested this much and there are surely >> other ways to achieve this: > I think this is a better approach then the one committed to > pg_bsd_indent. It's ubiquitous that the operators are binary, except - > as you mentioned - in a nonstandard GCC syntax. The alternative given > has more disadvantages, with potential impact on FreeBSD code > formatting, which it should support as well as everything else -- to a > reasonable extent. sys/kern/ is usually a decent litmus test, but I > don't claim it should show anything interesting in this particular case. I think that the fix we chose is better, at least for our purposes. You can see its effects on our source tree at https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa27dd40d5c5f56a1ee837a75c97549e992e32a4 and while certainly most of the diffs are around && or || operators, there are a fair number that are not, such as - dummy_lex.input = unconstify(char *, str) +1; + dummy_lex.input = unconstify(char *, str) + 1; or more interestingly - strncmp(text, "pg_", 3) !=0) + strncmp(text, "pg_", 3) != 0) where the previous misformatting is because "text" is a known typedef name. So it appears to me that this change reduces the misformatting cost of typedef names that chance to match field or variable names, and that's actually quite a big issue for us. We have, ATM, 3574 known typedefs in typedefs.list, a fair number of which are not under our control (since they come from system headers on various platforms). So it's inevitable that there are going to be collisions. In short, I'm inclined to stick with the hack we've got. I'm sad that it will result in further divergence from FreeBSD indent; but it does useful stuff for us, and I don't want to give it up. (That said, I don't see any penalty to carrying both changes; so we'll probably also absorb the &&/|| change at some convenient time.) regards, tom lane
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, May 22, 2020 at 4:40 AM David Rowley wrote: > On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > * It seems populate_baserel_uniquekeys, which actually sets uniquekeys, > > is called after create_index_paths, where index skip scan already > > needs to use them. Is it possible to call it earlier? > > Seems reasonable. I originally put it after check_index_predicates(). > We need to wait until at least then so we can properly build > UniqueKeys for partial indexes. > > Looks a very valid reason, I will add this in the next version. > > * Do I understand correctly, that a UniqueKey would be created in a > > simplest case only when an index is unique? This makes it different > > from what was implemented for index skip scan, since there UniqueKeys > > also represents potential to use non-unique index to facilitate search > > for unique values via skipping. > > The way I picture the two working together is that the core UniqueKey > patch adds UniqueKeys to RelOptInfos to allow us to have knowledge > about what they're unique on based on the base relation's unique > indexes. > For Skipscans, that patch must expand on UniqueKeys to teach Paths > about them. I imagine we'll set some required UniqueKeys during > standard_qp_callback() and then we'll try to create some Skip Scan > paths (which are marked with UniqueKeys) if the base relation does not > already have UniqueKeys that satisfy the required UniqueKeys that were > set during standard_qp_callback(). In the future, there may be other > reasons to create Skip Scan paths for certain rels, e.g if they're on > the inner side of a SEMI/ANTI join, it might be useful to try those > when planning joins. > > Yes, In current implementation, we also add UniqueKey during create_xxx_paths, xxx may be grouping/union. after the index skipscan patch, we can do the similar things in create_indexskip_path. -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
> if I understand correctly those two are introducing the concept, and others are just using it You are understand it correctly. -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, May 22, 2020 at 4:52 AM David Rowley wrote: > On Thu, 14 May 2020 at 14:39, Andy Fan wrote: > > > > On Thu, May 14, 2020 at 6:20 AM David Rowley > wrote: > >> Having the "onerow" flag was not how I intended it to work. > >> > > Thanks for the detailed explanation. So I think we do need to handle > onerow > > specially, (It means more things than adding each column as a UniqueKey). > > but we don't need the onerow flag since we can tell it by ukey->exprs == > NIL. > > > > During the developer of this feature, I added some Asserts as double > checking > > for onerow and exprs. it helps me to find some special cases. like > > SELECT FROM multirows union SELECT FROM multirows; where targetlist is > NIL. > > (I find the above case returns onerow as well just now). so onerow flag > allows us > > check this special things with more attention. Even this is not the > original intention > > but looks it is the one purpose now. > > But surely that special case should just go in > populate_unionrel_uniquekeys(). If the targetlist is empty, then add a > UniqueKey with an empty set of exprs. > > This is correct on this special case. > However I am feeling that removing onerow flag doesn't require much of > code > > changes. Almost all the special cases which are needed before are still > needed > > after that and all the functions based on that like relation_is_onerow > > /add_uniquekey_onerow is still valid, we just need change the > implementation. > > so do you think we need to remove onerow flag totally? > > Well, at the moment I'm not quite understanding why it's needed. If > it's not needed then we should remove it. If it turns out there is > some valid reason for it, then we should keep it. > Currently I uses it to detect more special case which we can't image at first, we can understand it as it used to debug/Assert purpose only. After the mainly code is reviewed, that can be removed (based on the change is tiny). -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Thu, 21 May 2020 at 00:56, Simon Riggs wrote: > I thought the main reason to do this was the case when the nested loop > subplan was significantly underestimated and we realize during execution that > we should have built a hash table. So including this based on cost alone > seems to miss a trick. Isn't that mostly because the planner tends to choose a non-parameterized nested loop when it thinks the outer side of the join has just 1 row? If so, I'd say that's a separate problem as Result Cache only deals with parameterized nested loops. Perhaps the problem you mention could be fixed by adding some "uncertainty degree" to the selectivity estimate function and have it return that along with the selectivity. We'd likely not want to choose an unparameterized nested loop when the uncertainly level is high. Multiplying the selectivity of different selectivity estimates could raise the uncertainty level a magnitude. For plans where the planner chooses to use a non-parameterized nested loop due to having just 1 row on the outer side of the loop, it's taking a huge risk. The cost of putting the 1 row on the inner side of a hash join would bearly cost anything extra during execution. Hashing 1 row is pretty cheap and performing a lookup on that hashed row is not much more expensive than evaluating the qual of the nested loop. Really just requires the additional hash function calls. Having the uncertainty degree I mentioned above would allow us to only have the planner do that when the uncertainty degree indicates it's not worth the risk. David
Re: [Proposal] Page Compression for OLTP
At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough > hooks are available? ISTM that foreign tables and/or alternative > storage engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need for these? Or are they not appropriate? You seem to > suggest that there are not. > > If not, what could be done to improve API to allow what you are seeking > to do? Maybe you need a somehow lower-level programmable API which does > not exist already, or at least is not exported already, but could be > specified and implemented with limited effort? Basically you would like > to read/write pg pages to somewhere, and then there is the syncing > issue to consider. Maybe such a "page storage" API could provide > benefit for some specialized hardware, eg persistent memory stores, > so there would be more reason to define it anyway? I think it might > be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data > only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file > system blocksize, quite often 4 KiB? Or maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? > I'd suggest that the possible overflow pointers (coa) could be part of > the headers so that when reading the 3.1 page, then the header would > tell where to find the overflow 3.2, without requiring an additional > independent structure with very small data in it, most of it zeros. > Possibly this is not possible, because it would require some available > space in standard headers when the is page is not compressible, and > there is not enough. Maybe creating a little room for that in > existing headers (4 bytes could be enough?) would be a good compromise. > Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but not for other target ratios, but I thin
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
> My question is whether it should be added as an optional facility of a > parameterised sub plan, rather than an always-needed full-strength node. > That way the choice of whether to use it can happen at execution time once > we notice that we've been called too many times. > > Actually I am not sure about what does the "parameterized sub plan" mean (I treat is a SubPlan Node), so please correct me if I misunderstand you:) Because the inner plan in nest loop not a SubPlan node actually. so if bind the facility to SubPlan node, we may loss the chances for nest loop. And when we consider the usage for nest loop, we can consider the below example, where this feature will be more powerful. select j1o.i, j2_v.sum_5 from j1 j1o inner join lateral (select im100, sum(im5) as sum_5 from j2 where j1o.im100 = im100 and j1o.i = 1 group by im100) j2_v on true where j1o.i = 1; -- Best Regards Andy Fan
Re: Parallel Seq Scan vs kernel read ahead
Hi Thomas, Some more data points: create table t_heap as select generate_series(1, 1) i; Query: select count(*) from t_heap; shared_buffers=32MB (so that I don't have to clear buffers, OS page cache) OS: FreeBSD 12.1 with UFS on GCP 4 vCPUs, 4GB RAM Intel Skylake 22G Google PersistentDisk Time is measured with \timing on. Without your patch: max_parallel_workers_per_gatherTime(seconds) 0 33.88s 1 57.62s 2 62.01s 6 222.94s With your patch: max_parallel_workers_per_gatherTime(seconds) 0 29.04s 1 29.17s 2 28.78s 6 291.27s I checked with explain analyze to ensure that the number of workers planned = max_parallel_workers_per_gather Apart from the last result (max_parallel_workers_per_gather=6), all the other results seem favorable. Could the last result be down to the fact that the number of workers planned exceeded the number of vCPUs? I also wanted to evaluate Zedstore with your patch. I used the same setup as above. No discernible difference though, maybe I'm missing something: Without your patch: max_parallel_workers_per_gatherTime(seconds) 0 25.86s 1 15.70s 2 12.60s 6 12.41s With your patch: max_parallel_workers_per_gatherTime(seconds) 0 26.96s 1 15.73s 2 12.46s 6 12.10s -- Soumyadeep On Thu, May 21, 2020 at 3:28 PM Thomas Munro wrote: > On Fri, May 22, 2020 at 10:00 AM David Rowley > wrote: > > On Thu, 21 May 2020 at 17:06, David Rowley wrote: > > > For the patch. I know you just put it together quickly, but I don't > > > think you can do that ramp up the way you have. It looks like there's > > > a risk of torn reads and torn writes and I'm unsure how much that > > > could affect the test results here. > > > > Oops. On closer inspection, I see that memory is per worker, not > > global to the scan. > > Right, I think it's safe. I think you were probably right that > ramp-up isn't actually useful though, it's only the end of the scan > that requires special treatment so we don't get unfair allocation as > the work runs out, due to course grain. I suppose that even if you > have a scheme that falls back to fine grained allocation for the final > N pages, it's still possible that a highly distracted process (most > likely the leader given its double duties) can finish up sitting on a > large range of pages and eventually have to process them all at the > end after the other workers have already knocked off and gone for a > pint. > > >
Re: Parallel Seq Scan vs kernel read ahead
On Fri, May 22, 2020 at 1:14 PM Soumyadeep Chakraborty wrote: > Some more data points: Thanks! > max_parallel_workers_per_gatherTime(seconds) > 0 29.04s > 1 29.17s > 2 28.78s > 6 291.27s > > I checked with explain analyze to ensure that the number of workers > planned = max_parallel_workers_per_gather > > Apart from the last result (max_parallel_workers_per_gather=6), all > the other results seem favorable. > Could the last result be down to the fact that the number of workers > planned exceeded the number of vCPUs? Interesting. I guess it has to do with patterns emerging from various parameters like that magic number 64 I hard coded into the test patch, and other unknowns in your storage stack. I see a small drop off that I can't explain yet, but not that. > I also wanted to evaluate Zedstore with your patch. > I used the same setup as above. > No discernible difference though, maybe I'm missing something: It doesn't look like it's using table_block_parallelscan_nextpage() as a block allocator so it's not affected by the patch. It has its own thing zs_parallelscan_nextrange(), which does pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10.
More tests with USING INDEX replident and dropped indexes
Hi all, While working on some other logical decoding patch recently, I bumped into the fact that we have special handling for the case of REPLICA IDENTITY USING INDEX when the dependent index is dropped, where the code handles that case as an equivalent of NOTHING. Attached is a patch to add more coverage for that. Any thoughts? -- Michael diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out index d79cd316b7..fb4a7d7689 100644 --- a/contrib/test_decoding/expected/ddl.out +++ b/contrib/test_decoding/expected/ddl.out @@ -565,6 +565,36 @@ UPDATE table_with_unique_not_null SET data = 3 WHERE data = 2; UPDATE table_with_unique_not_null SET id = -id; UPDATE table_with_unique_not_null SET id = -id; DELETE FROM table_with_unique_not_null WHERE data = 3; +-- check tables with dropped indexes used in REPLICA IDENTITY +-- with primary key +CREATE TABLE table_dropped_index_with_pk (a int PRIMARY KEY, b int, c int); +CREATE UNIQUE INDEX table_dropped_index_with_pk_idx + ON table_dropped_index_with_pk(a); +ALTER TABLE table_dropped_index_with_pk REPLICA IDENTITY + USING INDEX table_dropped_index_with_pk_idx; +DROP INDEX table_dropped_index_with_pk_idx; +INSERT INTO table_dropped_index_with_pk VALUES (1,1,1), (2,2,2), (3,3,3); +UPDATE table_dropped_index_with_pk SET a = 4 WHERE a = 1; +UPDATE table_dropped_index_with_pk SET b = 5 WHERE a = 2; +UPDATE table_dropped_index_with_pk SET b = 6, c = 7 WHERE a = 3; +DELETE FROM table_dropped_index_with_pk WHERE b = 1; +DELETE FROM table_dropped_index_with_pk WHERE a = 3; +DROP TABLE table_dropped_index_with_pk; +-- without primary key +CREATE TABLE table_dropped_index_no_pk (a int NOT NULL, b int, c int); +CREATE UNIQUE INDEX table_dropped_index_no_pk_idx + ON table_dropped_index_no_pk(a); +ALTER TABLE table_dropped_index_no_pk REPLICA IDENTITY + USING INDEX table_dropped_index_no_pk_idx; +INSERT INTO table_dropped_index_no_pk VALUES (1,1,1), (2,2,2), (3,3,3); +DROP INDEX table_dropped_index_no_pk_idx; +INSERT INTO table_dropped_index_no_pk VALUES (1,1,1), (2,2,2), (3,3,3); +UPDATE table_dropped_index_no_pk SET a = 4 WHERE a = 1; +UPDATE table_dropped_index_no_pk SET b = 5 WHERE a = 2; +UPDATE table_dropped_index_no_pk SET b = 6, c = 7 WHERE a = 3; +DELETE FROM table_dropped_index_no_pk WHERE b = 1; +DELETE FROM table_dropped_index_no_pk WHERE a = 3; +DROP TABLE table_dropped_index_no_pk; -- check toast support BEGIN; CREATE SEQUENCE toasttable_rand_seq START 79 INCREMENT 1499; -- portable "random" @@ -682,6 +712,56 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc table public.table_with_unique_not_null: DELETE: id[integer]:4 COMMIT BEGIN + table public.table_dropped_index_with_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1 + table public.table_dropped_index_with_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2 + table public.table_dropped_index_with_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3 + COMMIT + BEGIN + table public.table_dropped_index_with_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1 + COMMIT + BEGIN + table public.table_dropped_index_with_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2 + COMMIT + BEGIN + table public.table_dropped_index_with_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7 + COMMIT + BEGIN + table public.table_dropped_index_with_pk: DELETE: (no-tuple-data) + COMMIT + BEGIN + table public.table_dropped_index_with_pk: DELETE: (no-tuple-data) + COMMIT + BEGIN + table public.table_dropped_index_no_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1 + table public.table_dropped_index_no_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2 + table public.table_dropped_index_no_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3 + COMMIT + BEGIN + table public.table_dropped_index_no_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1 + table public.table_dropped_index_no_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2 + table public.table_dropped_index_no_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3 + COMMIT + BEGIN + table public.table_dropped_index_no_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1 + table public.table_dropped_index_no_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1 + COMMIT + BEGIN + table public.table_dropped_index_no_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2 + table public.table_dropped_index_no_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2 + COMMIT + BEGIN + table public.table_dropped_index_no_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7 + table public.table_dropped_index_no_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7 + COMMIT + BEGIN + table public.table_dropped_index_no_pk: DELETE: (no-tuple-data) + table public.table_dropped_index_no_pk: DELETE: (no-tuple-data) + COMMIT + BEGIN + table public.table_dropped_index_no_pk: DELETE: (no-tuple-data) + table public.table_dropped_index_no_pk: DELETE: (no-tuple-data) + COMMIT + BEGIN table public.toasttable: INSERT: id[integer]:1 toasted_
Re: race condition when writing pg_control
On Tue, May 5, 2020 at 9:51 AM Thomas Munro wrote: > On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan wrote: > > I believe I've discovered a race condition between the startup and > > checkpointer processes that can cause a CRC mismatch in the pg_control > > file. If a cluster crashes at the right time, the following error > > appears when you attempt to restart it: > > > > FATAL: incorrect checksum in control file > > > > This appears to be caused by some code paths in xlog_redo() that > > update ControlFile without taking the ControlFileLock. The attached > > patch seems to be sufficient to prevent the CRC mismatch in the > > control file, but perhaps this is a symptom of a bigger problem with > > concurrent modifications of ControlFile->checkPointCopy.nextFullXid. > > This does indeed look pretty dodgy. CreateRestartPoint() running in > the checkpointer does UpdateControlFile() to compute a checksum and > write it out, but xlog_redo() processing > XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without > interlocking. It looks like the ancestors of that line were there > since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran > UpdateControLFile() directly in the startup process (immediately after > that update), so no interlocking problem. Then in cdd46c76548 (2009), > RecoveryRestartPoint() was split up so that CreateRestartPoint() ran > in another process. Here's a version with a commit message added. I'll push this to all releases in a day or two if there are no objections. From 28ee04b537085148c75f575d2bc755a1fffc19c7 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 22 May 2020 16:23:59 +1200 Subject: [PATCH] Fix race condition that could corrupt pg_control. The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Back-patch to all supported release. Author: Nathan Bossart Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com --- src/backend/access/transam/xlog.c | 4 1 file changed, 4 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ca09d81b08..274b808476 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9976,7 +9976,9 @@ xlog_redo(XLogReaderState *record) } /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid; + LWLockRelease(ControlFileLock); /* Update shared-memory copy of checkpoint XID/epoch */ SpinLockAcquire(&XLogCtl->info_lck); @@ -10033,7 +10035,9 @@ xlog_redo(XLogReaderState *record) SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid; + LWLockRelease(ControlFileLock); /* Update shared-memory copy of checkpoint XID/epoch */ SpinLockAcquire(&XLogCtl->info_lck); -- 2.20.1
Re: POC: rational number type (fractions)
On Thu, May 21, 2020 at 01:40:10PM -0400, Robert Haas wrote: > On Mon, May 18, 2020 at 6:15 PM Tom Lane wrote: > > There surely are use-cases for true rational arithmetic, but I'm > > dubious that it belongs in core Postgres. I don't think that enough > > of our users would want it to justify expending core-project maintenance > > effort on it. So I'd be happier to see this as an out-of-core extension. > > As is often the case, I'm a little more positive about including this > than Tom, but as is also often the case, I'm somewhat cautious, too. > On the one hand, I think it would be cool to have and people would > like it. But, On the other hand, I also think we'd only want it if > we're convinced that it's a really good implementation and that > there's not a competing design which is better, or even equally good. I vote for keeping it out of core, mostly because writing rational numeric code is so different from writing DBMS core code. (Many of our existing types, like numeric and the geometric types, have the same problem. Let's not invite more of that.) The optimal reviewer pools won't have much overlap, so patches may sit awhile and/or settle for a cursory review. More language standard libraries provide "numeric"-style big decimals[1] than provide big rationals[2], suggesting we're in good company. [1] https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software#Languages [2] https://en.wikipedia.org/wiki/Rational_data_type#Language_support
Re: Planning counters in pg_stat_statements (using pgss_store)
On Thu, May 21, 2020 at 3:17 PM Michael Paquier wrote: > On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote: > > On Tue, May 19, 2020 at 4:29 AM Andy Fan > wrote: > >> Thanks for the excellent extension. I want to add 5 more fields to > satisfy the > >> following requirements. > >> > >> int subplan; /* No. of subplan in this query */ > >> int subquery; /* No. of subquery */ > >> int joincnt; /* How many relations are joined */ > >> bool hasagg; /* if we have agg function in this query */ > >> bool hasgroup; /* has group clause */ > > > > Most of those fields can be computed using the raw sql satements. Why > > not adding functions like query_has_agg(querytext) to get the > > information from pgss stored query text instead? > > Yeah I personally find concepts related only to the query string > itself not something that needs to be tied to pg_stat_statements. > While reading about those five new fields, I am also wondering how > this stuff would work with CTEs. Particularly, should the hasagg or > hasgroup flags be set only if the most outer query satisfies a > condition? What if an inner query satisfies a condition but not an outer query? Should joincnt just be the sum of all the joins done in > all queries, including subqueries? > The semantics is for overall query not for most outer query. see codes like this for example: query_characters.hasagg |= parse->hasAggs; query_characters.hasgroup |= parse->groupClause != NIL; > Most of those fields can be computed using the raw sql satements. Why > not adding functions like query_has_agg(querytext) to get the > information from pgss stored query text instead? That mainly because I don't want to reparse the query again. -- Best Regards Andy Fan
Re: Planning counters in pg_stat_statements (using pgss_store)
On Thu, May 21, 2020 at 3:49 PM Julien Rouhaud wrote: > Le jeu. 21 mai 2020 à 09:17, Michael Paquier a > écrit : > >> On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote: >> > On Tue, May 19, 2020 at 4:29 AM Andy Fan >> wrote: >> >> Thanks for the excellent extension. I want to add 5 more fields to >> satisfy the >> >> following requirements. >> >> >> >> int subplan; /* No. of subplan in this query */ >> >> int subquery; /* No. of subquery */ >> >> int joincnt; /* How many relations are joined */ >> >> bool hasagg; /* if we have agg function in this query */ >> >> bool hasgroup; /* has group clause */ >> > >> > Most of those fields can be computed using the raw sql satements. Why >> > not adding functions like query_has_agg(querytext) to get the >> > information from pgss stored query text instead? >> >> Yeah I personally find concepts related only to the query string >> itself not something that needs to be tied to pg_stat_statements. >> ... >> > > Indeed cte will bring additional concerns about the fields semantics. > That's another good reason to go with external functions so you can add > extra parameters for that if needed. > >> There are something more we can't get from query string easily. like: 1. view involved. 2. subquery are pulled up so there is not subquery indeed. 3. sublink are pull-up or become as an InitPlan rather than subPlan. 4. joins are removed by remove_useless_joins. -- Best Regards Andy Fan
Re: [Proposal] Page Compression for OLTP
Sorry, There may be a problem with the display format of the previous mail. So resend it At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough >hooks are available? ISTM that foreign tables and/or alternative >storage engine (aka ACCESS METHOD) provide convenient APIs which could >fit the need for these? Or are they not appropriate? You seem to >suggest that there are not. > >If not, what could be done to improve API to allow what you are seeking >to do? Maybe you need a somehow lower-level programmable API which does >not exist already, or at least is not exported already, but could be >specified and implemented with limited effort? Basically you would like >to read/write pg pages to somewhere, and then there is the syncing >issue to consider. Maybe such a "page storage" API could provide >benefit for some specialized hardware, eg persistent memory stores, >so there would be more reason to define it anyway? I think it might >be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data >only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file >system blocksize, quite often 4 KiB? Or maybe yes? Or is there >a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? >I'd suggest that the possible overflow pointers (coa) could be part of >the headers so that when reading the 3.1 page, then the header would >tell where to find the overflow 3.2, without requiring an additional >independent structure with very small data in it, most of it zeros. >Possibly this is not possible, because it would require some available >space in standard headers when the is page is not compressible, and >there is not enough.
Re: Problem with pg_atomic_compare_exchange_u64 at 32-bit platforms
On Wed, May 20, 2020 at 10:59:44AM +0300, Konstantin Knizhnik wrote: > On 20.05.2020 10:36, Noah Misch wrote: > >On Wed, May 20, 2020 at 10:23:37AM +0300, Konstantin Knizhnik wrote: > >>On 20.05.2020 06:05, Noah Misch wrote: > >>>On Tue, May 19, 2020 at 04:07:29PM +0300, Konstantin Knizhnik wrote: > Definition of pg_atomic_compare_exchange_u64 requires alignment of > expected > pointer on 8-byte boundary. > > pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr, > uint64 *expected, uint64 newval) > { > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > AssertPointerAlignment(ptr, 8); > AssertPointerAlignment(expected, 8); > #endif > > > I wonder if there are platforms where such restriction is actually > needed. > >>>In general, sparc Linux does SIGBUS on unaligned access. Other platforms > >>>function but suffer performance penalties. > >>Well, if platform enforces strict alignment, then addressed value should be > >>properly aligned in any case, shouldn't it? > >No. One can always cast a char* to a uint64* and get a misaligned read when > >dereferencing the resulting pointer. > > Yes, certainly we can "fool" compiler using type casts: > > char buf[8]; > *(int64_t*)buf = 1; PostgreSQL does things like that, so the assertions aren't frivolous. > But I am speaking about normal (safe) access to variables: > > long long x; > > In this case "x" compiler enforces proper alignment of "x" for the target > platform. > We are not adding AssertPointerAlignment to any function which has pointer > arguments, aren' we? Most functions don't have such assertions. That doesn't make it wrong for this function to have them. > I understand we do we require struct alignment pointer to atomic variables > even at the platforms which do not require it > (as Andreas explained, if value cross cacheline, it will cause significant > slowdown). > But my question was whether we need string alignment of expected value? I expect at least some platforms need strict alignment, though I haven't tried to prove it. > >>void f() { > >> int x; > >> long long y; > >> printf("%p\n", &y); > >>} > >> > >>then its address must not be aligned on 8 at 32-bit platform. > >>This is why "expected" in test_atomic_uint64 may not be aligned on 8-byte > >>boundary and we can get assertion failure. > >Can you construct a patch that adds some automatic variables to a regress.c > >function and causes an assertion inside pg_atomic_compare_exchange_u64() to > >fail on some machine you have? I don't think win32 behaves as you say. If > >you can make a test actually fail using the technique you describe, that > >would > >remove all doubt. > I do not have access to Win32. > But I think that if you just add some 4-byte variable before "expected" > definition, then you will get this assertion failure (proposed patch is > attached). > Please notice that PG_HAVE_ATOMIC_U64_SIMULATION should not be defined and > Postgres is build with --enable-cassert and CLAGS=-O0 > > Also please notice that my report is not caused just by hypothetical problem > which I found out looking at Postgres code. > We actually get this assertion failure in pg_atomic_compare_exchange_u64 at > Win32 (not in regress.c). Given https://postgr.es/m/flat/20150108204635.GK6299%40alap3.anarazel.de was necessary, that is plausible. Again, if you can provide a test case that you have confirmed reproduces it, that will remove all doubt. You refer to a "we" that has access to a system that reproduces it.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, May 19, 2020 at 6:01 PM Amit Kapila wrote: > > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar wrote: > > I have further reviewed v22 and below are my comments: v22-0005-Implement-streaming-mode-in-ReorderBuffer -- 1. + * Note: We never do both stream and serialize a transaction (we only spill + * to disk when streaming is not supported by the plugin), so only one of + * those two flags may be set at any given time. + */ +#define rbtxn_is_streamed(txn) \ +( \ + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ +) The above 'Note' is not correct as per the latest implementation. v22-0006-Add-support-for-streaming-to-built-in-replicatio 2. --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -14,7 +14,6 @@ * *- */ - #include "postgres.h" Spurious line removal. 3. +void +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn, +XLogRecPtr commit_lsn) +{ + uint8 flags = 0; + + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */ + + Assert(TransactionIdIsValid(txn->xid)); + + /* transaction ID (we're starting to stream, so must be valid) */ + pq_sendint32(out, txn->xid); The part of the comment "we're starting to stream, so must be valid" is not correct as we are not at the start of the stream here. The patch has used the same incorrect sentence at few places, kindly fix those as well. 4. + * XXX Do we need to allocate it in TopMemoryContext? + */ +static void +subxact_info_add(TransactionId xid) { .. For this and other places in a patch like in function stream_open_file(), instead of using TopMemoryContext, can we consider using a new memory context LogicalStreamingContext or something like that. We can create LogicalStreamingContext under TopMemoryContext. I don't see any need of using TopMemoryContext here. 5. +static void +subxact_info_add(TransactionId xid) This function has assumed a valid value for global variables like stream_fd and stream_xid. I think it is better to have Assert for those in this function before using them. The Assert for those are present in handle_streamed_transaction but I feel they should be in subxact_info_add. 6. +subxact_info_add(TransactionId xid) /* + * In most cases we're checking the same subxact as we've already seen in + * the last call, so make ure just ignore it (this change comes later). + */ + if (subxact_last == xid) + return; Typo and minor correction, /ure just/sure to 7. +subxact_info_write(Oid subid, TransactionId xid) { .. + /* + * But we free the memory allocated for subxact info. There might be one + * exceptional transaction with many subxacts, and we don't want to keep + * the memory allocated forewer. + * + */ a. Typo, /forewer/forever b. The extra line at the end of the comment is not required. 8. + * XXX Maybe we should only include the checksum when the cluster is + * initialized with checksums? + */ +static void +subxact_info_write(Oid subid, TransactionId xid) Do we really need to have the checksum for temporary files? I have checked a few other similar cases like SharedFileSet stuff for parallel hash join but didn't find them using checksums. Can you also once see other usages of temporary files and then let us decide if we see any reason to have checksums for this? Another point is we don't seem to be doing this for 'changes' file, see stream_write_change. So, not sure, there is any sense to write checksum for subxact file. Tomas, do you see any reason for the same? 9. +subxact_filename(char *path, Oid subid, TransactionId xid) +{ + char tempdirpath[MAXPGPATH]; + + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); + + /* + * We might need to create the tablespace's tempfile directory, if no + * one has yet done so. + */ + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + tempdirpath))); + + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts", + tempdirpath, subid, xid); +} Temporary files created in PGDATA/base/pgsql_tmp follow a certain naming convention (see docs[1]) which is not followed here. You can also refer SharedFileSetPath and OpenTemporaryFile. I think we can just try to follow that convention and then additionally append subid, xid and .subxacts. Also, a similar change is required for changes_filename. I would like to know if there is a reason why we want to use different naming convention here? 10. + * This can only be called at the beginning of a "streaming" block, i.e. + * between stream_start/stream_stop messages from the upstream. + */ +static void +stream_close_file(void) The comment seems to be wrong. I think this can be only called at stream end, so it should be "This can only be called at the end
Re: Optimizer docs typos
On Wed, May 20, 2020 at 7:17 PM Etsuro Fujita wrote: > On Tue, May 19, 2020 at 7:35 PM Etsuro Fujita wrote: > > On Mon, May 18, 2020 at 7:45 PM Richard Guo wrote: > > > In this same README doc, another suspicious typo to me, which happens in > > > section "Optimizer Functions", is in the prefix to query_planner(), > > > we should have three dashes, rather than two, since query_planner() is > > > called within grouping_planner(). > > > > > > diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README > > > index 7dcab9a..bace081 100644 > > > --- a/src/backend/optimizer/README > > > +++ b/src/backend/optimizer/README > > > @@ -315,7 +315,7 @@ set up for recursive handling of subqueries > > >preprocess target list for non-SELECT queries > > >handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates, > > > ORDER BY, DISTINCT, LIMIT > > > ---query_planner() > > > +---query_planner() > > > make list of base relations used in query > > > split up the qual into restrictions (a=1) and joins (b=c) > > > find qual clauses that enable merge and hash joins > > > > Yeah, you are right. Another one would be in the prefix to > > standard_join_search(); I think it might be better to have six dashes, > > rather than five, because standard_join_search() is called within > > make_rel_from_joinlist(). > > Here is a patch including the change I proposed. (Yet another thing I > noticed is the indent spaces for join_search_one_level(): that > function is called within standard_join_search(), so it would be > better to have one extra space, for consistency with others (eg, > set_base_rel_pathlists() called from make_one_rel()), but that would > be too nitpicking.) This is more like an improvement, so I'll apply > the patch to HEAD only, if no objestions. Done. Best regards, Etsuro Fujita